references

All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of
Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8
Detecting Local Outliers in High Dimensional Space
Dr. Dharm Singh1, Tarun Kulshreshtha2, Divya Saxena2, Ankur Agrawal3
1Department of Computer Science, College of Technology and Engineering
Maharana Pratap University of Agriculture and Technology, Udaipur
[email protected]
2 Department of Computer Science, G.L.A. Institute of Technology and Management, Mathura, India
[email protected]
3 Department of Computer Science, Indian Institute of Technology Bombay, Powai, Mumbai, Maharashtra,
India
[email protected]
Abstract – Outlier detection has received significant attention in
many applications, such as detecting credit card fraud or
network intrusions. Existing studies in outlier detection mostly
focus on detecting outliers in full feature space. But most
algorithms tend to break down in high-dimensional feature
spaces because classes of objects often exist in specific subspace
of the original feature space. As a novel solution to tackle this
problem, we propose here a preference dimension based outlier
detection technique, which uses a unique subspace for
calculating distance between objects. Using this concept we
adopt local density based outlier detection to cope with highdimensional data. A broad experimental evaluation shows that
this approach yields results of significantly better quality than
existing algorithms.
Keywords: Outlier detection, high-dimensional data
1.
INTRODUCTION
Outlier mining is an active research field in data mining and
has a lot of practical applications in many different domains,
such as, financial fraud detection [1], network intrusion detection
[2], medical abnormal reactions analysis [3], and signal
preprocessing [4] etc.
Hawkins gave the definition of outlier that "An outlier is an
observation that deviates so much from other observations as to
arouse suspicion that it was generated by a different mechanism"
[5]. Based on the definition, researchers have proposed various
schemes for outlier mining. Statistical-based approaches were
the earliest approaches used for outlier mining [6]. The statistical
based approaches assume that the dataset have a distribution and
an object having low probability with respect to the distribution
may be an outlier [7]-[8]. However, this approach requires a
prior underlying distribution of the dataset to compute the outlier
scores, which is usually unknown. In order to overcome the
limitations of the statistical-based approaches, the distancebased [9] and density-based [10] approaches were introduced to
detect outliers, which use k-nearest neighbors (KNN) to compute
the similarity between data points.
Many useful outlier detection methods proposed in the last
decade compute outlier score of the data points in a complete
feature space, i.e. each dimension is equally weighted when
computing the distance between points. These approaches are
successful for low-dimensional data sets. However, in higher
(a)
(b)
Fig. 1. Outliers according to local density-based outlier detection (a) and
according to preference dimension based outlier detection (b).
dimensional feature spaces, their accuracy and efficiency
deteriorates significantly. The major reason for this behavior is
the so-called curse of dimensionality: In high dimensional
feature spaces, a full-dimensional distance is often no longer
meaningful, since the nearest neighbor of a point is expected to
be almost as far as its farthest neighbor [11].
In this paper, we use the concept of preference dimension to
solve this problem. Our new approach PDOF is founded on the
concept of local outlier factor proposed in density-based local
outlier detection [10]. In the example given in Fig. 1, a local
density based algorithm will detect only one outlier i.e. o 1 as
shown in Fig. 1(a) while preference dimension based algorithm
will detect two outliers o1 and o2 as shown in Fig. 1(b).
In order to ensure the quality of outlier detection in highdimensional datasets, we suggest using only a subset of the full
feature space called preference dimension to compute outliers in
a subspace instead of using each dimension with equal
importance. Thus, we build a weight vector based on the
variance in each attribute and use a weighted Euclidean distance
measure based on this weight vector. Using this more flexible
model, we propose the algorithm PDOF (Preference Dimension
Outlier Factor) to efficiently compute exact solutions of the
subspace outlier detection problem. The user can select a
parameter Ξ² indicating the variance threshold for feature
All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of
Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8
selection in the preference dimension of the dataset. Only those
features with a variance of no more than Ξ² are included in the
preference dimension.
The remainder of the paper is organized as follows: In
Section 2, we discuss related work and point out our
contributions. In Section 3, we formalize our notion of
preference dimension efficiently detect subspace based outliers.
In Section 4, we discuss the time-complexity of our approach.
Section 5 contains an extensive experimental evaluation and
Section 6 concludes the paper.
2.
RELATED WORK
Definition 5: (local outlier factor of an object p) The local
outlier factor of p is defined as
𝐿𝑂𝐹𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝) =
π‘™π‘Ÿπ‘‘π‘€π‘–π‘›π‘π‘‘π‘  (π‘œ)
π‘™π‘Ÿπ‘‘π‘€π‘–π‘›π‘ƒπ‘‘π‘  (𝑝)
|𝑁𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝)|
βˆ‘π‘œ ∈ NMinPts (p)
It is easy to see that the lower lrdMinPts(p) is, and the higher the
lrdMinPts(o), o ∈ NMinPts(p) are, the higher is the LOF value of p.
The outlier factor of object p captures the degree to which we
call p an outlier. However, the weakness of the LOF algorithm is
that it cannot detect the outliers existing in a subspace of full
feature space.
2.1 LOF Algorithm
2.2 DSNOF Algorithm
LOF (Local Outlier Factor)[10] is the first concept of an
object which also quantifies how outlying an object is and the
LOF value of an object is based on the average of the ratios of
the local reachability density of the area around the object and
the local reachability densities of its neighbors. The size of the
neighborhood of the object is determined by the area containing
a user-supplied minimum number of points (MinPts). Several
concepts and terms to explain the LOF algorithm can be defined
as follows.
The DSNOF algorithm[12] considers the density similarity
between the object and its neighbors and calculates the DSNOF
value. In this algorithm each object in dataset is assigned a
density-similarity-neighbor based outlier factor (DSNOF) to
indicate the degree of outlier-ness of an object. The algorithm
calculates the densities of an object and its neighbors and
constructs the similar density series (SDS) in the neighborhood
of the object. Based on the SDS, this algorithm computes the
average series cost (ASC) of the objects. Finally, it calculates the
DSNOF of the object based on the ASC of the object and its
neighbors.
Definition 1: (k-distance of an object p) For any positive
integer k, the k-distance of object p, denoted as k-distance(p), is
defined as the distance d(p,o) between p and an object of D such
that:
(i) for at least k objects oβ€² ∈ D\{p} it holds that d(p,oβ€²) ≀
d(p,o) , and
(ii) for at most k - 1 objects oβ€² ∈ D\{p} it holds that d(p,oβ€²) <
d(p,o).
Definition 2: (k-distance neighborhood of an object p)
Given the k-distance of p, the k-distance neighborhood of p
contains every object whose distance from p is not greater than
the k-distance.
Definition 3: (reachability distance of an object p w.r.t.
object o) Let k be a natural number. The reachability distance of
object p with respect to object o is defined as
π‘Ÿπ‘’π‘Žπ‘β„Ž βˆ’ π‘‘π‘–π‘ π‘‘π‘˜ (𝑝) = max { π‘˜ βˆ’ π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’(π‘œ), 𝑑(𝑝, π‘œ)}
Let MinPts be the only parameter and the values reach-distMinPts
(p,o), for o ∈ NMinPts(p), be a measure of the volume to determine
the density in the neighborhood of an object p.
Definition 4: (local reachability density of an object p) The
local reachability density of p is defined as
π‘™π‘Ÿπ‘‘π‘€π‘–π‘›π‘π‘‘π‘  (𝑝) = 1/ [
βˆ‘π‘œ ∈ NMinPts(p) π‘Ÿπ‘’π‘Žπ‘β„Ž βˆ’ 𝑑𝑖𝑠𝑑𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝, π‘œ)
]
|𝑁𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝)|
2.2 Subspace Outlier Detection in Data with Mixture of
Variances and Noise
This algorithm[13] introduced a bottom-up approach to
discover clusters of outliers in any m-dimensional subspace from
an n-dimensional space. It uses an outlier score function based
on Chebyshev (L∞ norm) distance in order to properly rank the
outliers.
First, it uses a method to compute the outlier score for all
points in each dimension. It states that if a point is an outlier in a
subspace, the score must be high for that point in each dimension
of the subspace. It then aggregates the scores to compute the
final outlier score for the points in the dataset. It introduces a
filter threshold to eliminate the high dimensional noise during
the aggregation.
2.3 Our Contributions
In this paper, we make the following contributions: In order
to enhance the quality of density-based outlier detection
algorithms in high-dimensional space, we extend the wellfounded notion of density-based local outliers to ensure high
quality results even in high-dimensional spaces. We do not use
any sampling or approximation techniques, thus the result of our
outlier-mining algorithm is determinate. We propose an efficient
method called PDOF which is able to detect all subspace based
outliers of a certain dimensionality in a single scan over the
All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of
Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8
database and is linear in the number of dimensions. And finally,
we successfully apply our algorithm PDOF to several real-world
data sets, showing its superior performance over existing
approaches.
3.
THE PROPOSED ALGORITHM (PDOF)
In this section, we formalize the notion of preference
dimension based outliers. Let D be a database of d dimensional
points (D βŠ† Rd), where the set of attributes is denoted by A = {A1,
. . . Ad}. The projection of a point p onto an attribute Ai ∈ A is
denoted by Ο€Ai (p). Intuitively, a preference dimension based
outlier is an object whose density is low as compared to its
neighbors in a certain subspace of the full feature space.
In order to find meaningful attributes to form a preference
dimension, we select only those attributes which have low
variance in the complete dataset. But some of the values may be
far apart from the mean value for an attribute (possibly for
outliers). These values will hinder our variance calculation to
capture the normal behavior of the dataset. Thus we need to
eliminate these values. For that we use standard deviation as a
measure.
Definition 1 (controlled variance along an attribute)
Let gi is the mean value and Οƒi is the standard deviation for an
attribute Ai ∈ A. The controlled variance of dataset along Ai ,
denoted by VarAi(D), is defined as follows:
π‘‰π‘Žπ‘Ÿπ΄π‘– (𝐷) =
βˆ‘π‘ ∈ π·βˆΆπ‘”π‘–βˆ’3σ𝑖<πœ‹π΄
𝑖
2
(𝑝)<gi +3Οƒi (𝑑𝑖𝑠𝑑(πœ‹π΄π‘– (𝑝), 𝑔𝑖 )
|𝑝 ∈ 𝐷 ∢ 𝑔𝑖 βˆ’ 3σ𝑖 < πœ‹π΄π‘– (𝑝) < g i + 3Οƒi |
The intuition of our formalization is to consider those attributes
in preference dimension which have a low variance in the
dataset. Therefore, we use a weight vector wp which to calculate
distance between two points in the preference dimension.
Definition 2 (preference dimension similarity measure)
Let β ∈ R be a user-defined threshold on the allowed variance.
Let wp = (w1, w2, ...wd) be the so-called preference dimension
weight vector, where
𝑀𝑖 = {
0 𝑖𝑓 π‘‰π‘Žπ‘Ÿπ΄π‘– (𝐷) > 𝛽
1 𝑖𝑓 π‘‰π‘Žπ‘Ÿπ΄π‘– (𝐷) ≀ 𝛽
The preference dimension similarity measure between two points
p,q∈ D is denoted by
𝑑
π‘‘π‘–π‘ π‘‘π‘π‘Ÿπ‘’π‘“ (𝑝, π‘ž) = βˆšβˆ‘ 𝑀𝑖 . (πœ‹π΄π‘– (𝑝) βˆ’ πœ‹π΄π‘– (π‘ž))2
𝑖=1
where wi is the i-th component of wp.
Let us note, that the preference dimension similarity
measure distpref(p, q) is simply a weighted Euclidean distance.
The parameter Ξ² specifies the threshold for a low variance. As
we are only interested in distinguishing between dimensions with
low variance and all other dimensions, weighting the dimensions
inversely proportional to their variance is not useful. Thus, our
weight vector has only two possible values.
After calculating the preference dimension distance between
different data objects, we now use this distance measure to find
out the outlying degree for each object as in LOF algorithm with
some modifications.
Definition 3 (preference dimension k-distance of an object
p) For any positive integer k, the preference dimension kdistance of object p, denoted as k-distpref(p), is defined as the
distance distpref(p,o) between p and an object of D such that:
(i) for at least k objects oβ€² ∈ D\{p} it holds that
distpref(p,oβ€²) ≀ distpref(p,o) , and
(ii) for at most k - 1 objects oβ€² ∈ D\{p} it holds that
distpref(p,oβ€²) < distpref(p,o).
Definition 4 (preference dimension k-distance neighborhood
of an object p) Given the preference dimension k-distance of p,
the preference dimension k-distance neighborhood of p contains
every object whose preference dimension distance from p is not
greater than the preference dimension k-distance.
Definition 5 (preference dimension reachability distance of
an object p w.r.t. object o) Let k be a natural number. The
preference dimension reachability distance of object p with
respect to object o is defined as
π‘π‘Ÿπ‘’π‘“π‘Ÿπ‘’π‘Žπ‘β„Ž βˆ’ π‘‘π‘–π‘ π‘‘π‘˜ (𝑝)
= π‘šπ‘Žπ‘₯ { π‘˜ βˆ’ π‘‘π‘–π‘ π‘‘π‘π‘Ÿπ‘’π‘“ (π‘œ), π‘‘π‘–π‘ π‘‘π‘π‘Ÿπ‘’π‘“ (𝑝, π‘œ)}
Let MinPts be the only parameter and the values prefreachdistMinPts (p,o), for o ∈ NMinPts(p), be a measure of the volume to
determine the density in the neighborhood of an object p.
Definition 6 (preference dimension reachability density of
an object p) The preference dimension reachability density of p
is defined as
π‘π‘‘π‘Ÿπ‘‘π‘€π‘–π‘›π‘π‘‘π‘  (𝑝)
βˆ‘π‘œ ∈ NMinPts(p) π‘π‘Ÿπ‘’π‘“π‘Ÿπ‘’π‘Žπ‘β„Ž βˆ’ 𝑑𝑖𝑠𝑑𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝, π‘œ)
= 1/ [
]
|𝑁𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝)|
Definition 7 (preference dimension outlier factor of an
object p) The preference dimension outlier factor of p is defined
as
All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of
Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8
𝑃𝐷𝑂𝐹𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝) =
π‘π‘‘π‘Ÿπ‘‘π‘€π‘–π‘›π‘π‘‘π‘  (π‘œ)
π‘π‘‘π‘Ÿπ‘‘π‘€π‘–π‘›π‘ƒπ‘‘π‘  (𝑝)
|𝑁𝑀𝑖𝑛𝑃𝑑𝑠 (𝑝)|
βˆ‘π‘œ ∈ NMinPts(p)
The higher the pdrdMinPts(o), o ∈ NMinPts(p) are relative to
pdrdMinPts(p), the higher is the PDOF value of p. The outlier
factor of object p captures the degree to which we call p an
outlier. Thus, we can find n outliers existing in a certain
subspace of full feature space by taking the top n objects ordered
by their PDOF values.
4.
COMPLEXITY ANALYSIS
The overall worst-case time complexity of our algorithm
based on the sequential scan of the dataset is O(d · n2). Our
algorithm first calculates variance along each attribute which
takes O(d · n) time. Then we assign a preference dimension
weight vector which takes O(d) time.
The preference dimension k-distance neighborhood for all
points can be calculated in O(d · n2) time. Calculation of
preference dimension reachability distance between all pairs of
points require linear time. The pdrd and PDOF values of all the
points can then be calculated in O(k · n) time. Thus the worst
case complexity of our algorithm comes out to be O(d · n2) with
the assumption of no index structure. As this complexity is
realistic, our algorithm can be used practically to detect outliers
in high-dimensional database.
5.
found our algorithm performing better in most of the cases.
Table 1 summarizes the results of running both LOF and PDOF
algorithms on the test dataset with minpts = 10 and Ξ² = 0.0005
for PDOF.
TABLE 1
RESULTS OF RUNNING LOF AND PDOF ALGORITHMS ON TEST DATASET
K
LOF
PDOF
20
8
7
30
12
13
40
14
15
50
18
20
60
21
23
80
26
28
100
30
30
EXPERIMENTAL EVALUATION
We conducted all our experiments on a workstation with a
Pentium 4 2.6 GHz processor and 1.5 GB of RAM. We
implemented all algorithms in Java. In order to test the ability of
our algorithm to detect outliers, we used the KDD CUP 99
Network Connections Data Set from the UCI repository [14].
This database contains a standard set of data to be audited, which
includes a wide variety of intrusions simulated in a military
network environment. The task was to detect the attack
connections without any prior knowledge about the properties of
the network intrusion attacks. The detection will be simply based
on the hypothesis that the attack connections may behave
differently from the normal network activities which makes them
outliers. We created a test dataset from the KDD original dataset
with 1000 connections. Each record has 38 continuous attributes
representing the statistics of a connection and its associated
connection type, i.e. normal, buffer overflow attack. A very
small number of attack connections are randomly selected. There
are 22 types of attacks with the size varying from 1 to 4. Totally,
there are 43 attack connections in the test dataset.
We used the LOF algorithm as a baseline to compare our
results as it is a well-known outlier detection method. We ran
both the algorithms for different values of input parameters. We
Fig. 3. Execution time for LOF and PDOF algorithms
In table 1, K indicates the total no of outliers detected (including
correct and incorrect detection) and the entries corresponding to
LOF and PDOF indicate the actual number of intrusion attacks
detected by the two algorithms. As we can see from the table,
our PDOF algorithm consistently performed well over the LOF
algorithm. . In figure 3, we have drawn the curve of execution
time with no of data points corresponding to the two algorithms.
6.
CONCLUSION
In this paper we proposed an enhancement of LOF outlier
detection algorithm to cope with the curse of dimensionality by
introducing the notion of preference dimension so that outliers
All India Seminar on ICT for Rural Development: Access and Applications (ICTRD-2009) Held under the Computer Engineering Division, The Institutions of
Engineers (India), Local Chapter, Udaipur, 2009 ISBN: 978-81-7906-197-8
existing in certain subspace of full feature space are detected. We
have shown that our approach is superior to the well-known LOF
algorithm in detecting outliers in high-dimensional databases.
The proposed algorithm LSOF first detects the meaningful
attributes for each object by limiting the allowed variance in the
dataset. A parameter Ξ² is used to limit the amount of variance.
The future work can be to determine the value of parameter Ξ²
automatically for better performance for any given data set.
REFERENCES
[1] Dianmin Yue, Xiaodan Wu, Yunfeng Wang, Yue Li, and Chao-Hsien Chu,
β€œA Review of Data Mining-Based Financial Fraud Detection Research”,
Proceedings of 2007 International Conference on Wireless Communications,
Networking and Mobile Computing, Shanghai, P. R. China, September 21-25,
2007, pp. 5514-5517.
[2] Jiong Zhang and Mohammad Zulkernine, β€œAnomaly Based Network
Intrusion Detection with Unsupervised Outlier Detection”, Proceedings of 2006
IEEE International Conference on Communications, Istanbul, Turkey, June 1115, 2006, pp. 2388-2393.
[3] Vili Podgorelec, Marjan Heri_ko, and Ivan Rozman, β€œImproving Mining of
Medical Data by Outliers Prediction”, Proceedings of 18th IEEE International
Symposium on Computer-Based Medical Systems, Trinity College Dublin,
Ireland, June 23-24, 2005, pp. 91-96.
[4] Jari Näsi, Aki Sorsa, and Kauko Leiviskä, β€œSensor Validation And Outlier
Detection Using Fuzzy Limits”, Proceedings of 44th IEEE Conference on
Decision and Control, and the European Control Conference, Seville, Spain,
December 12-15, 2005, pp. 7828-7833.
[5] Hawkins D., Identification of Outliers, Chapman and Hall, London, 1980.
[6] Victoria J. Hodge and Jim Austin, β€œA Survey of Outlier Detection
Methodologies”, Artificial Intelligence Review, Kluwer Academic, Netherlands,
2004, pp. 85-126.
[7] E. Eskin, β€œAnomaly Detection over Noisy Data Using Learned Probability
Distributions”, Proceedings of 17th International Conference on Machine
Learning, Stanford, CA, USA, June 29-July 2, 2000, pp. 255-262.
[8] Yamanishi K. and Takeuchi J., β€œDiscovering Outlier Filtering Rules from
Unlabeled Data-Combining a supervised
Learner with an Unsupervised Learner”, Proceedings of 7th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, San
Francisco, CA, USA, August 26-29, 2001, pp. 389-394
[9] Edwin M. Knorr and Raymond T. Ng. Algorithms for mining distance-based
outliers in large datasets. In VLDB ’98: Proceedings of the 24rd International
Conference on Very Large Data Bases, pages 392–403, San Francisco, CA, USA,
1998. Morgan Kaufmann Publishers Inc.
[10] M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: identifying
density-based local outliers. SIGMOD Rec., 29(2):93–104, 2000.
[11] A. Hinneburg, C. C. Aggarwal, and D. A. Keim. ”What is the Nearest
Neighbor in High Dimensional Spaces”. In Proc. 26th Int. Conf. on Very Large
Databases (VLDB’00), Cairo, Egypt, 2000.
[12] Hui Cao, Gangquan Si, Wenzhi Zhu, Yanbin Zhang. β€œEnhancing
Effectiveness of Density-based Outlier Mining”.2008 International Symposiums
on Information Processing.
[13] M. Q. Nguyen, L. Mark, E. Omiecinski. β€œSubspace Outlier Detection in
Data with Mixture of Variances and Noise”. Report Number GT-CS-08-11,
Georgia Institute of
Technology, Atlanta, GA 30332, USA,2008.
[14] C. B. D. Newman and C. Merz. UCI repository of machine learning
databases, 1998.