Computationally Easy Outlier Detection via Projection Pursuit with

Computationally Easy Outlier Detection via Projection
Pursuit with Finitely Many Directions
Robert Serfling1 and Satyaki Mazumder2
University of Texas at Dallas
and
Indian Institute of Science, Education and Research
December, 2012
1 Department
of Mathematics, University of Texas at Dallas, Richardson, Texas 75080-3021,
USA. Email: [email protected]. Website: www.utdallas.edu/∼serfling. Support by
NSF Grants DMS-0805786 and DMS-1106691 and NSA Grant H98230-08-1-0106 is gratefully acknowledged.
2
Indian Institute of Science, Education and Research, Kolkata, India
Abstract
Outlier detection is fundamental to data analysis. Desirable properties are affine invariance,
robustness, low computational burden, and nonimposition of elliptical contours. However,
leading methods fail to possess all of these features. The Mahalanobis distance outlyingness
(MD) imposes elliptical contours. The projection outlyingness (SUP), powerfully involving
projections of the data onto all univariate directions, is highly intensive computationally.
Computationally easy variants using projection pursuit with but finitely many directions
have been introduced, but these fail to capture at once the other desired properties. Here
we develop a “robust Mahalanobis spatial outlyingness on projections” (RMSP) function
which indeed satisfies all four desired properties. Pre-transformation to a strong invariant
coordinate system yields affine invariance, “spatial trimming” yields robustness, and “spatial
Mahalanobis outlyingness” is used to obtain computational ease and smooth, unconstrained
contours. From empirical study using artificial and actual data, our findings are that SUP
is outclassed by MD and RMSP, that MD and RMSP are competitive, and that RMSP is
especially advantageous in describing the intermediate outlyingness structure when elliptical
contours are not assumed.
AMS 2000 Subject Classification: Primary 62H99. Secondary 62G99
Key words and phrases: Outlier detection; Projection pursuit; Nonparametric; Multivariate;
Affine invariance; Robustness.
1
Introduction
Outlier identification is fundamental to multivariate statistical analysis and data mining.
Using a selected outlyingness function defined on the sample or input space, “outliers” are
those points whose outlyingness value exceeds a specified threshold. Desirably, outlyingness
functions have the properties (i) robustness against the presence of outliers, (ii) weak affine
invariance (i.e., transformation to other coordinates should not affect relative “outyingness”
rankings and comparisons), (iii) computational efficiency in any practical dimension, and
(iv) nonimposition of elliptical contours. We point out that, although elliptical contours
are attractive and in some cases justified, an outlyingness function that does not arbitrarily
impose them can more effectively identify which points belong to the layers of moderate
outlyingness.
The widely used Mahalanobis distance outlyingness (MD) satisfies (i)-(iii) but gives up
(iv). On the other hand, the projection outlyingness (SUP) based on the projection pursuit
approach satisfies (i), (ii), and (iv) but gives up (iii). Here we construct a modified projection
pursuit outlyingness (RMSP) that satisfies all of (i)-(iv).
Projection pursuit techniques, originally proposed and experimented with by Kruskal
(1969, 1972), are extremely powerful in multivariate data analysis. Related ideas occur in
Switzer (1970) and Switzer and Wright (1971). A key implementation is due to Friedman
and Tukey (1974). Recently, “projection depth” has received significant attention in the
literature (Liu, 1992; Zuo and Serfling, 2000b; Zuo, 2003). The corresponding projection
outlyingness (SUP) is a multivariate extension of univariate scaled deviation outlyingness
O(x) = |x−µ|/σ, where µ and σ are univariate measures of location and spread. Specifically,
for a multivariate data point x, the “projection outlyingness” is given by the supremum of
all the univariate outlyingness values of the projections of x onto lines (Liu, 1992; Zuo and
Serfling, 2000b; Zuo, 2003; Serfling, 2004; Dang and Serfling, 2010).
Let us make this precise. Let X have distribution FX on Rd and, for any unit vector u =
(u1, . . . , ud )0 in Rd , let Fu0 X denote the induced univariate distribution of u0 X. With µ given
by the Median, say ν, and σ by the MAD (median absolute deviation from the median), say
η, the associated well-known projection outlyingness (SUP) is
0
u x − ν(Fu0 X ) , x ∈ Rd ,
OP (x, FX ) = sup (1)
0
η(F
)
kuk = 1
uX
representing the worst case scaled deviation outlyingness of projections of x onto lines. For a
d-dimensional data set Xn = {X1 , . . . , Xn }, the sample version OP (x, Xn ) is affine invariant,
highly masking robust (Dang and Serfling, 2010), and does not impose ellipsoidal contours.
However, it is obviously highly computational.
To overcome the computational burden, we develop an alternative projection pursuit
outlyingness entailing only finitely many projections. Implementation involves the following
key steps, which are described in detail in the sequel.
Step 1. With only finitely many projections, OP (x, Xn ) is no longer affine invariant,
not even orthogonally invariant. However, if we first transform the data to X∗n =
1
D(Xn )Xn using a strong invariant coordinate system (SICS) transformation D(Xn ),
then the quantity
u0 x∗ − ν(X∗n )
η(X∗n )
for any u becomes invariant under affine transformation of the original data via Xn 7→
Yn = AXn +b for nonsingular d×d A and d-vector b (see Serfling, 2010). Consequently,
to capture affine invariance for our method, a SICS transformation is applied to Xn
before taking projections.
In order not to lose information on the “outliers”, we need the SICS transformation
D(Xn ) to be robust. This is accomplished by constructing D(Xn ) from a subset ZN
of N “inner” observations from Xn , selected in an affine invariant way, with N =
0.5n or N = 0.75n, for example. One possibility for ZN is the N inner observations
used in computing the well-known Fast-MCD sample scatter matrix (Rousseeuw and
Van Driessen, 1999, implemented in the R packages MASS, rrcov, and robustbase,
for example). However, this becomes computationally prohibitive as the dimension
d increases. Instead, we use a computationally attractive spatial trimming approach
(Mazumder and Serfling, 2013) that chooses the N observations with lowest values of
sample Mahalanobis spatial outlyingness.
Step 2. The computational burden is reduced by taking relatively few projections,
but enough to capture sufficient information. From both intuitive and computational
considerations, the use of deterministic directions strongly appeals. In particular, we
choose s = 5d directions
∆ = {ui , 1 ≤ i ≤ s}
approximately uniformly distributed on the unit sphere in Rd but lying on distinct
diameters. Fang and Wang (1994) provide convenient numerical algorithms for this
purpose.
Step 3. Projection pursuit is carried out. For each X∗i in X∗n and direction uj in ∆,
the scaled deviation outlyingness of the projected value u0j X ∗i is computed:
u0j X ∗i − ν(u0j X∗n )
, 1 ≤ j ≤ s, 1 ≤ i ≤ n,
an (i, j) =
η(u0j X∗n )
with ν(u0j X∗n ) and η(u0j X∗n ) the sample median and sample MAD, respectively, of the
projected data set u0j X∗n . This replaces the original data set Xn of n d-vectors, which
then becomes replaced by a new data set
An = [an (i, j)]s×n
of n s-vectors of projected scaled deviations. These contain the relevant “outlyingness
information”.
2
Step 4. Redundancy in An is eliminated by a PCA reduction. With J indexing the
N inner observations of Step 1, the (robust) sample covariance matrix S n of the data
set given by the columns of An indexed by J is formed. Then PCA based on S n is used
to transform and reduce the data vectors An to t data vectors Vn , where t denotes the
number of positive eigenvalues of S n . Typically, t = d .
Step 5. We construct our new outlyingness function by again employing the Mahalanobis spatial outlyingness function and the spatial trimming step of Step 1, but
now applied to the data Vn instead of Xn . This yields a Robust Mahalanobis Spatial
outlyingness function based on Projections (RMSP),
ORMSP (X i , Xn ), 1 ≤ i ≤ n,
a robust outlyingness ordering of the original data points in Xn via projection pursuit
with finite ∆.
Summary:
Xn (original data: n vectors in Rd )
↓ (via D(Xn ))
X∗n (SICS-transformed data: n vectors in Rd )
↓ (via O(x, F ) and ∆)
An (n projected scaled deviation vectors in Rs , s = 5d)
↓ (via PCA)
Vn (n vectors in Rt , typically with t = d)
↓ (via robust Mahalanobis spatial outlyingness)
ORMSP
Details for Step 1 are provided in Section 2. Following a method of Serfling (2010) for
construction of sample SICS functionals in conjunction with spatial trimming of observations (Mazumder and Serfling, 2013), we introduce an easily computable and robust SICS
functional to be used for the purposes of this paper.
Details for Steps 3 and 4 are provided in Section 3, and those for Step 5 in Section 4,
leading to our affine invariant, robust, and easily computable outlyingness function RMSP.
In Section 5, we compare SUP, MD, and RMSP using artificial bivariate data sets for
the sake of visual comparisons, artificial higher-dimensional data sets, and actual higherdimensional data sets, the Stackloss Data (n = 21, d = 4), and the Air Pollution and
Mortality Data (n = 59, d = 13), which have been studied extensively in the literature.
Our findings and conclusions are provided in Section 6. Briefly, SUP is outclassed by
MD and RMSP, MD and RMSP are competitive, and RMSP is especially advantageous in
describing the intermediate outlyingness structure when elliptical contours are not assumed.
We conclude this introduction by mentioning some notable previous work on projection
pursuit outlyingness using only finitely many projections. Despite their various strengths,
however, none of these methods capture all of properties (i)-(iv).
3
• Pan, Fung, and Fang (2000) use finitely many deterministic directions approximately
uniformly scattered and develop a finite ∆ approach calculating a sample quadratic
form based on the differences
{O(u0x, u0 Xn ) − O(u0 x, Fu0 X ), u ∈ ∆}.
This imposes elliptical outlyingness contours. Further, since these differences involve
the unknown F , a bootstrap step is incorporated. The number of directions is datadriven and the method is not affine invariant.
• Peña and Prieto (2001) introduce an affine invariant method using the supremum and
2d data-driven directions. These are selected using univariate measures of kurtosis over
candidate directions, choosing the d directions with local extremes of high kurtosis and
the d directions with local extremes of low kurtosis. In a very complex algorithm, the
“outliers” are ultimately selected using Mahalanobis distance (thus entailing elliptical
contours).
• Filzmozer, Maronna, and Werner (2008) extend the Peña and Prieto (2001) approach
into an even more elaborate one, adding a principal components step. This achieves
certain improvements in performance for detection of location outliers, especially in
high dimension. However, this gives up affine invariance. See also Maronna, Martin,
and Yohai (2006).
2
Step 1: a Robust SICS Transformation
In general, by weak affine invariance of a functional T (F ) is meant that
T (Ax + b, FAX + b ) = c T (x, FX), x ∈ Rd ,
where Ad×d is nonsingular, b is any vector in Rd and c = c(A, b, FX) is a constant. Here,
corresponding to some given finite ∆ = {u1 , . . . , us }, we are interested in the functional
ζ(x, ∆, FX ) =
u01 x − ν(Fu01 X )
u0 x − ν(Fu0s X )
,..., s
η(Fu01 X )
η(Fu0sX )
0
, x ∈ Rd ,
whose components give (signed) scaled deviation outlyingness values for the projections of
a point x onto the lines represented by ∆. It is straightforward to verify by simple counterexamples that ζ(x, ∆, FX) is not weakly affine invariant (nor even orthogonally invariant).
However, we can make it become weakly affine invariant by first applying to the variable x
a strong invariant coordinate system (SICS) functional (Serfling, 2010). Also, as applied to
the sample version ζ(x, ∆, Xn) we also want the SICS transformation to be robust.
4
2.1
Standardization of ζ(x, ∆, FX ) using a SICS functional
Definition 1 (Serfling, 2010). A positive definite matrix-valued functional D(F ) is a strong
invariant coordinate system (SICS) functional if, for Y = A X + b,
D(FY ) = k3 D(FX ) A−1 ,
where Ad×d is nonsingular, b is any vector in Rd , and k3 = k3 (A, b, FX ) is a scalar.
Detailed treatment is found in Tyler, Critchley, Dümbgen and Oja (2009), Serfling (2010),
and Ilmonen, Oja, and Serfling (2012). We now establish that the SICS-standardized vector
ζ(D(FX)x, ∆, FD(FX)X )
is weakly affine invariant.
Theorem 2 Let X have distribution FX on Rd and let X∗ = D(FX)X, where D(F ) is a
SICS functional. Then ζ(x∗ , ∆, FX∗ ) is weakly affine invariant.
Proof of Theorem 2. Suppose X 7→ Y = AX+ b, where A is a d× d nonsingular matrix
and b is any vector in Rd . Now, transform Y 7→ Y∗ = D(FY )Y. Then, using Definition 1,
we have Y∗ = D(FAX + b )(AX + b) = k D(FX)X + k D(FX)A−1 b = k X∗ + c, where k =
k(b, A, FX), and c =k D(FX )A−1 b. Hence, for 1 ≤ j ≤ s,
u0j y∗ − med(u0j Y∗ )
MAD(u0j Y∗ )
u0j (k x∗ + c) − med(u0j (k X∗ + c))
=
MAD(u0j (k X∗ + c))
u0j x∗ − med(u0j X∗ )
.
= sgn(k)
MAD(u0j X∗ )
Thus ζ(y∗, ∆, FY∗ ) = sgn(k) ζ(x∗ , ∆, FX∗ ) and the result follows.
2.2
2
A robust SICS transformation via spatial trimming
In general, following Serfling (2010), we may construct a SICS functional as follows. Let ZN
be a subset of Xn of size N obtained through some affine invariant and permutation invariant
procedure. Then form d + 1 means Z1 , . . . , Zd+1 based, respectively, on consecutive blocks
of size m = bN/(d + 1)c from ZN , and define the matrix
W(Xn ) = [(Z2 − Z1 ), . . . , (Zd+1 − Z1)]d×d .
Then the matrix
D(Xn ) = W(Xn )−1
5
(2)
is a sample SICS functional.
The robustness and computational burden rest on the method of choosing ZN . One
implementation is to let ZN be the set of the observations selected and used in computing
the well-known Fast-MCD scatter matrix with N = αn, where α = 0.5 or 0.75. This
uses all the observations in a permutationally invariant way in selecting ZN and all of the
latter observations in defining W(Xn ). However, due to its combinatorial character, this
approach becomes computationally prohibitive as d increases. Instead we shall develop a
computationally easy robust sample SICS functional based on letting ZN consist of the
“inner” observations obtained in spatial trimming, as follows. We start with the following
definition.
Definition 3 (Serfling, 2010). A positive definite matrix-valued functional M(F ) is a
transformation-retransformation (TR) functional if,
A0 M (Yn )0 M (Yn )A = k2 M (Xn )0 M (Xn )
(3)
for Y = AX + b, and with k2 = k2 (A, b, Xn ) a positive scalar function of A, b, and Xn .
Any inverse square root of a scatter or shape matrix is a TR matrix. These are discussed
in detail in Serfling (2010) and, in particular, it is shown that any SICS functional is a TR
functional.
In particular, we will use a TR functional introduced by Tyler (1987) to achieve certain
favorable theoretical properties in the elliptical model and extended by Dümbgen (1998). It
is easily computed in any dimension, making it an computationally attractive alternative to
Fast-MCD, although not as robust. With respect to a specified location functional θ(Xn ),
the Tyler matrix is defined as C(Xn ) = (M s (Xn )0 M s (Xn ))−1 , with M s (Xn ) the TR matrix
defined as the unique symmetric square root of C(Xn )−1 obtained through the M-estimation
equation
−1
n
0 n X
M s (Xn )(X i − θ(Xn ))
M s (Xn )(X i − θ(Xn ))
= d−1 I d .
kM
(X
)(X
−
θ(X
))k
kM
(X
)(X
−
θ(X
))k
s
n
i
n
s
n
i
n
i=1
(4)
An iterative algorithm using Cholesky factorizations to compute M s (Xn ) quickly in any
practical dimension is given in Tyler (1987). Another solution of (4) is given by the upper
triangular square root M t (Xn ) of C(Xn )−1 and is computed by a similar algorithm. In Tyler
(1987), the quantity θ(Xn ) is specified as a known constant. For inference situations when
θ(Xn ) is not known or specified, a symmetrized version of M s (Xn ) eliminating the need of
a location measure is given by Dümbgen (1998): M s1 (Xn ) = M s (Xn Xn ), where Xn Xn
denotes the set of differences X i −X j . Convenient R-packages (e.g., ICSNP) are available for
computation of these estimators. See Tyler (1987) and Serfling (2010) for detailed discussion.
Here we will utilize the “Dümbgen-Tyler” TR matrix M s1 (F ) by “DT”.
We will make use of the TR matrix DT via the Mahalanobis spatial outlyingness function
(Serfling, 2010) and the notion of spatial trimming (Mazumder and Serfling, 2013). Based
6
on any given TR matrix M (Xn ), and based on the spatial sign function (or unit vector
function),
(
x , x ∈ Rd , x 6= 0,
kxk
S(x) =
0,
x = 0,
a corresponding affine invariant “Mahalanobis spatial outlyingness function” is given by
n
X
OMS (x, Xn ) = n−1
S (M (Xn )(x − X i )) , x ∈ Rd ,
(5)
i=1
Without standardization by any M (Xn ), (5) gives the well-known “spatial outlyingness
function” (Chaudhuri, 1996), which is only orthogonally invariant. Standardization by a TR
matrix produces affine invariance.
“Spatial trimming” (possibly depending on n and/or d) consists of trimming away those
“outer” observations satisfying OMS (X i , Xn ) > λ0 , for some specified threshold λ0 . Then
the covariance matrix of the remaining “inner” observations is robust against outliers. For
M (Xn ) given by DT, this is computationally faster than Fast-MCD. We denote by J the
set of indices of the “inner” observations selected by spatial trimming.
For choice of the threshold λ0 , we follow Mazumder and Serfling (2013) and recommend
λ0 =
d
,
d+2
(6)
which is justified as follows. The outlyingness function OMS without standardization has
masking breakdown point (MBP) approximately (1 − λ0 )/2 (Dang and Serfling, 2010). This
is the minimal fraction of observations in the sample which, if corrupted, can cause arbitrarily
extreme outliers to become “masked” (misclassified as nonoutliers). One may obtain an MBP
as high as possible (≤ 1/2) by choosing λ0 small enough. However, to avoid overtrimming
and losing “resolution”, a moderate choice is prudent. With standardization, we also need to
take account of the explosion replacement breakdown point RBPexp of the chosen TR matrix
M (Xn ). We will adopt the easily computable DT, which has RBPexp subject to the upper
bound (d + 1)−1 , decreasing with d (see Dümbgen and Tyler, 2005). We balance (1 − λ0 )/2
with (d + 1)−1 . However, to avoid overrobustification in low dimensions, we instead use
(d + 2)−1 , thus solving (1 − λ0 )/2 = (d + 2)−1 , to obtain (6). The increase of λ0 with d is
reasonable, since in higher dimension distributions place greater probability in the tails and
observations are more dispersed, causing outlyingness values to tend to be larger overall.
In summary: our SICS functional D(Xn ) is given by the sample version of (2) with Zn
and J determined via spatial trimming using OMS based on the TR matrix DT and with
trimming threshold given by (6).
7
3
Steps 3 and 4: Application of Projection Pursuit and
PCA
Let Xn = {X1 , . . . , Xn } be a random sample from some F on Rd and let D(Xn ) be a sample
SICS functional (as, for example, the one obtained above). We describe in Algorithm A
below the steps for transforming the original data Xn to a set of t-dimensional vectors Vn =
{V1 (Xn ), . . . , Vn (Xn )} by a projection pursuit approach using ui , 1 ≤ i ≤ s, combined with
a SICS preliminary transformation and a PCA dimension reduction step. We also provide a
key property of Vn .
Algorithm A, for Vn
1. Standardize Xi , 1 ≤ i ≤ n, with a given sample SICS functional D(Xn )d×d via X∗i =
D(Xn ) Xi , 1 ≤ i ≤ n, and put X∗n = [X∗1 · · · X∗n ].
2. Choose the number of directions s and select s unit vectors ∆ = {u1 , . . . , us } uniformly
distributed on the unit sphere in Rd but lying on distinct diameters, following the
algorithm of Fang and Wang (1994). From trial and error with examples of various
dimensions, choices such as s = 4d, 5d, or 6d are effective. We recommend: s = 5d .
3. Calculate the projections u0j X∗i , 1 ≤ i ≤ n, 1 ≤ j ≤ s, and denote by ν(u0j X∗n ) and
η(u0j X∗n ) the median and MAD, respectively, of {u0j X∗1 , . . . , u0j X∗n )1×n }, 1 ≤ j ≤ s.
4. Put An = [a1 · · · an ]s×n , where ai = (an (i, 1), . . . , an (i, s))0s×1 , 1 ≤ i ≤ n, with
u0j X ∗i − ν(u0j X∗n )
an (i, j) =
, 1 ≤ j ≤ s, 1 ≤ i ≤ n.
η(u0j X∗n )
The initial d × n data matrix Xn of n d-vectors now has been converted to a new
data matrix An of dimension s × n, i.e., consisting of n s-vectors a1 , . . . , an , with ai
associated with the original data point X i , 1 ≤ i ≤ n.
5. Let Sn denote the sample covariance matrix of those columns ai of An with indices i
in the set J. Calculate the eigenvalues λ1 ≥ . . . ≥ λs of Sn and let P = [p1 · · · ps ]s×s
denote the orthogonal matrix containing the corresponding eigenvectors p1, . . . , ps as
column vectors.
e i = P0 ai , 1 ≤ i ≤ n. Let t
6. PCA reduction. Define the s × 1 dimensional vectors V
−6
be the number of eigenvalues greater than 10 (typically, t = d.) Let the t-vector Vi
e i , 1 ≤ i ≤ n, and put
contain the first t components of the vector V
Vn = [V1 · · · Vn ]t×n .
This completes reduction of the original d×n data matrix Xn to a new t×n data matrix
Vn . It is readily checked that the covariance matrix of each vector V(i), 1 ≤ i ≤ n, is
Λt×t = diag(λ1 , . . . , λt ).
8
We note that the above algorithm is affine invariant, in the sense that if we transform
Xi → Yi = AXi + b, 1 ≤ i ≤ n, where A is any d × d dimensional nonsingular matrix
and b is any vector in Rd , then the matrix of vectors Vi , 1 ≤ i ≤ n, will remain the same
up to a global sign change. This is stated formally as follows (the proof is straightforward
and omitted).
Lemma 4 Let X1 , . . . , Xn be a random sample from d dimensional distribution function
F . Suppose Yi = A Xi + b, where A is any d × d dimensional nonsingular matrix and
b is any vector in Rd . Put Yn = [Y1 · · · Yn ]. Further, assume that Vn (Xn ) is obtained
using Algorithm A starting with the data matrix Xn , and that Vn (Yn ) is obtained using
Algorithm A starting with the data matrix Yn . Then
Vn (Yn ) = sgn(k) Vn (Xn ),
(7)
where k = k(A, b, Xn ).
4
Robust Mahalanobis Spatial Outlyingness using Vn
In Algorithm A, the data vector Xi becomes replaced by Vi , 1 ≤ i ≤ n, or more generally
points x ∈ Rd are mapped onto points v ∈ Rt . We formulate our new robust affine invariant
outlyingness function ORMSP (x, Xn ) (denoted by RMSP) as the robust Mahalanobis spatial
outlyingness function on the vectors Vi , 1 ≤ i ≤ n. Thus ORMSP (x, Xn ) = OMS (v(x), Vn ),
where v(x) is the vector associated with x via Algorithm A. As before, we use DT (on Vn )
for the standardization defining OMS . After transforming the given data Xn = [X1 · · · Xn ]d×n
to Vn = [V1 · · · Vn ]t×n , using inner observations indexed by J, we follow the steps below to
form RMSP.
Algorithm B, for RMSP
1. Using the Mahalanobis spatial outlyingness function (5) on Vn (instead of Xn ), and
with TR matrix M (Vn ) = DT(Vn ), i.e, using
n
X
−1
OMS (v, Vn ) = n
S (M (Vn )(v − V i )) , v ∈ Rt ,
(8)
i=1
carry out the spatial trimming method of Section 2.2 based on trimming threshold
t
.
t+2
Let J∗ denote the set of indices of the “inner” observations so obtained, and let V∗n
denote the inner observations.
2. Now robustify DT(Vn ) by computing it just on V∗n : DT(V∗n )
9
3. Then RMSP is defined according to (8), but with the robustified DT for standardization and with the averaging taken over just the inner observations. That is, denoting
by K the cardinality of J∗ , and with M (V∗n ) = DT(V∗n ), we form
−1 X
ORMSP (x, Xn ) = K
S (M (V∗n )(v(x) − V j )) , v ∈ Rt .
(9)
∗
j∈J
In particular, for the data points X i , 1 ≤ i ≤ n, the corresponding RMSP outlyingness
values are given by
X
(10)
ORMSP (X i , Xn ) = K −1
S (M (V∗n )(V i − V j )) , 1 ≤ i ≤ n.
∗
j∈J
Lemma 5 ORMSP (x, Xn ) is affine invariant, in the sense that if we transform Xi → Yi =
AXi + b, where Ad×d is nonsingular and b is any vector in Rd , then
ORMSP (y, Yn ) = ORMSP (x, Xn),
with y = Ax + b.
(The proof is immediate and omitted.)
5
Comparison of RMSP, SUP, and MD
5.1
Visual illustration with artificial bivariate data
Using artificially created bivariate data sets, we are able to provide visual illustrations of
important differences among RMSP, SUP, and MD. Here MD denotes robust Mahalanobis
distance outlyingness with the Fast-MCD location and scatter measures based on the inner
50% observations.
Besides identifying the more extreme outliers, an outlyingness function also has the role
of providing a structural description of the data set. In effect, this provides a quantile-based
description. Our plots exhibit 50%, 75%, and 90% outlyingness contours (based on the given
outlyingness function and enclosing 50%, 75%, and 90% of the observations, respectively).
Figure 1 shows a triangularly shaped data set with and without outliers, Figure 2 a
bivariate Pareto data with and without outliers, and Figure 3 a bivariate Normal data with
and without outliers. Based on these displays, we make the following comments.
1. SUP is dominated by MD and RMSP. All three of SUP, MD, and RMSP have similar
contours: strictly ellipsoidal for MD and approximately so for the others. However,
the contours of SUP are not as smooth as the those of the others and its computational
burden is considerably greater.
10
2. The robustness properties of SUP, MD, and RMSP are comparable for moderate contamination. For moderate contamination (somewhat less than 50%), all three methods
perform equally well in constructing contours that account for the outliers without
becoming distorted by them. For each of SUP, MD, and RMSP, the outlier cluster becomes placed on or beyond the 90% contour and well outside the 75% contour,
rather than, for example, pulling the 75% contour toward itself. On the other hand, if
protection against very high contamination is needed, SUP and MD can be adjusted
to have nearly 50% BP, while RMSP cannot.
3. RMSP is overall superior for moderate contamination. With less computational burden
and without imposing elliptical contours, RMSP is as robust as MD and SUP.
5.2
Numerical experiments with artificial data
A small experiment comparing SUP (with 100,000 directions), MD (with Fast-MCD), and
RMSP was carried out on a typical PC for multivariate standard Normal data with sample
size n = 100 and dimensions d = 2, 5, 10, and 20 (only d = 2 for SUP).
1. First, each of the 15 most extreme observations according to Euclidean distance was
replaced by itself times the factor 5. All three methods correctly selected the 15 outliers
as distinctly more outlying than the other 85 observations (although not always with
the same ranks).
Times in seconds:
d SUP
2 330.69
5
10
20
MD
0.48
2.03
3.97
6.73
RMSP
2.55
3.65
4.60
4.94
Clearly, SUP is prohibitively computational, compared with MD and RMSP. The
well-established MD with Fast-MCD exhibits high computational efficiency for very
low dimensions. The above times are based on setting the tuning parameter “nsamp”
in MD(MCD) using rrcov to make its breakdown point match that of RMSP, namely
(d + 2)−1 . Thus we used nsamp = 6, 10, 11, and 12 for d = 2, 5, 10, and 20 (versus the
default nsamp = 500, which results in slightly higher computation times). However,
the advantage of MD is quickly lost as d increases, and the computational appeal of
RMSP becomes evident.
2. As a more challenging variant carried out just for d = 2, each of the 15 most extreme
observations according to Euclidean distance was replaced by itself plus the vector
(10, 0). Both RMSP and MD again correctly ranked the 15 outliers as more extreme
in outlyingness than the other 85 (although not always in the same order).
11
From all considerations, RMSP is the method of choice for moderate contamination. A
competitor is MD when elliptical contours are acceptable, while SUP may be dropped from
consideration. This study raises the question of just how much we do or do not like the
ellipsoidal contours of MD. A detailed evaluation of this question is worthwhile but beyond
the scope of the present paper.
5.3
Actual data
For two well-studied higher-dimensional data sets, we examine the performance of MD and
RMSP. Here, for compatibility with Becker and Gather (1999), MD denotes the robust
Mahalanobis distance with S-estimates for location and covariance based on Tukey’s biweight
(BW) function. (Separate investigation shows little difference between this MD and that
based on MCD.)
Stackloss Data (d = 4)
The stackloss data set of Brownlee (1965) has been much studied as a test of outlier detection
methods. It consists of n = 21 observations in dimension d = 4. See Rousseeuw and Leroy
(1987) and Becker and Gather (1999) for the data set, for references to many studies, and
for general discussion. All robust methods cited in the literature, including MD, rank
observations 1, 3, 4, and 21 as the top 4 in outlyingness, with little difference in order. Our
RMSP agrees with these rankings and also with MD on observation 2 as 5th, 13 as 6th, and
17 as 7th. Thus RMSP corroborates existing approaches while being more computationally
attractive. Figure 4 shows that the main difference between MD and RMSP with this data
set is the ranking of moderately outlying observations.
Pollution and Mortality Data (d = 13)
Becker and Gather (1999) study in detail a 13-dimensional data set available at Data and
Story Library (http://lib.stat.cmu.edu/DASL/) which provides information on social and
economic conditions, climate, air pollution, and mortality for 60 Standard Metropolitan
Statistical Areas (a standard U. S. Census Bureau designation of the region around a major
city) in the United States. They omit a case with incomplete data and rank the remaining
n = 59 cases in dimension d = 13 by MD. Here we compare RMSP and MD. Both agree
on the extreme observations indexed 28, 47, 46, 48 and ranked in this order. Also, they
agree on 11 cases as among the next 12 cases, although with somewhat differing ranks. The
exceptions are that MD ranks observation 36 as 14th and observation 38 as 22nd, while
RMSP ranks these as 17th and 16th, respectively. The difference in ranks 16 versus 22 for
observation 38 raises the question of whether observation 36 (New Orleans) in comparison
with observation 38 (Philadelphia) should rank far apart, 14th versus 22nd as per MD, or
closely, 17th versus 16th as per RMSP.
12
Coordinatewise dotplots of all 13 variables for these cases reveals that these points overall
are not very outlying, except that 36 is moderate to extreme in outlyingness for mortality
and moderate for SO2 pollution, and that 38 is moderate to strong in outlyingness for
population size, moderate for population density, and nearly moderate for SO2 pollution.
On this basis we regard 36 and 38 as comparable cases both moderate in outlyingness,
corroborating the opinion of RMSP over that of MD.
This illustrates the advantage of not imposing elliptical contours: the levels of moderate
outlyingness can be delineated better. Figure 4 shows that the main difference between MD
and RMSP with this data set is the ranking of moderately outlying observations.
6
Conclusions and Recommendations
Based on our theoretical development and empirical analysis, we make the following basic
conclusions about SUP, MD, and RTRP.
1. RMSP has all four desired properties: (i) affine invariance, (ii) robustness, (iii) easily
computable in all dimensions, and (iv) no imposition of elliptical contours.
2. SUP is outclassed by RMSP, which is also a projection pursuit approach but one
much more attractive computationally and – among other finite ∆ approaches – easy
to understand.
3. Robust versions of MD and RMSP are competitive, but, by not imposing elliptical
contours, RMSP is of particular advantage in describing intermediate outlyingness
structure.
4. The spatial trimming technique used here with Vn to develop a new projection pursuit
approach is used just on Xn in Mazumder and Serfling (2013) to produce a new “robust
spatial outlyingness”. It provides another alternative to MD and performs similarly to
RMSP, although the latter seems slightly better due to acquiring projection pursuit
information. A detailed comparison in some depth would be of interest but is beyond
the scope of the present paper.
5. In practice, one might use both MD (when computationally feasible) and RMSP.
When they agree, one can be confidant. When they disagree, the relevant observations
can be investigated.
Acknowledgements
The authors gratefully acknowledge very helpful, insightful comments from reviewers. Useful
input from G. L. Thompson is also greatly appreciated. Support under National Science
Foundation Grants DMS-0805786 and DMS-1106691 and National Security Agency Grant
H98230-08-1-0106 is sincerely acknowledged.
13
References
[1] Becker, C. and Gather, U. (1999). The masking breakdown point of multivariate outlier
identification rules. Journal of the American Statistical Association 94 947–955.
[2] Brownlee, K. A. (1965). Statistical Theory and Methodology in Science and Engineering,
2nd edition. John Wiley & Sons, New York.
[3] Chaudhuri, P. (1996). On a geometric notion of quantiles for multivariate data. Journal
of the American Statistical Association 91 862–872.
[4] Dang, X. and Serfling, R. (2010). Nonparametric depth-based multivariate outlier identifiers, and masking robustness properties. Journal of Statistical Planning and Inference
140 198–213.
[5] Dümbgen, L. (1998). On Tyler’s M-functional of scatter in high dimension. Annals of
the Institute of Statistical Mathematics 50 471–491.
[6] Dümbgen, L. and Tyler, D. E. (2005). On the breakdown properties of some multivariate
M-functionals. Scandinavian Journal of Statistics 32 247–264.
[7] Fang, K.T. and Wang, Y. (1994). Number Theoretic Methods in Statistics. Chapman
and Hall, London.
[8] Filzmoser, P., Maronna, R., and Werner, M. (2008). Outlier identification in high
dimensions. Computational Statistics & Data Analysis 52 1694–1711.
[9] Friedman, J. H. and Tukey, J. W. (1974). A projection pursuit algorithm for explatory
data analysis. IEEE Trans. Comput. C-23 881-889.
[10] Ilmonen, P., Oja, H., and Serfling, R. (2012). On invariant coordinate system (ICS)
functionals. International Statistical Review 80 93–110.
[11] Kruskal, J. B. (1969). Toward a practical method which helps uncover the structure of
a set of multivariate observations by finding the linear transformation which optimizes
a new ’new index of consideration’. In Statistical Computation. (R. C. Milton and J. A.
Nelder, eds.) Academic, New York.
[12] Kruskal, J. B. (1972). Linear transformation to multivariate data to reveal clustering. In
Multidimensional scaling: Theory and Application in the Behavioral Science, I, Theory.
Seminar Press, New York and London.
[13] Liu, R. Y. (1992). Data depth and multivariate rank tests. In L1 -Statistics and Related
Methods (Y. Dodge, ed.), pp. 279–294, North-Holland, Amsterdam.
[14] Maronna, R. A., Martin, R. D., and Yohai, V. J. (2006). Robust Statistics: Theory and
Methods. Wiley, Chichester, England.
14
[15] Mazumder, S. and Serfling, R. (2013). A robust sample spatial outlyingness function.
Journal of Statistical Planning and Inference 143 144–159.
[16] Pan, J.-X., Fung, W.-K., and Fang, K.-T. (2000). Multiple outlier detection in multivariate data using projection pursuit techniques. Journal of Statistical Planning and
Inference 83 153–167.
[17] Peña, D. and Prieto, F. J. (2001). Robust covariance matrix estimation and multivariate
outlier rejection. Technometrics 43 286–310.
[18] Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regression and Outlier Detection.
John Wiley & Sons, New York.
[19] Rousseeuw, P. and Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics 41 212–223.
[20] Serfling, R. (2004). Nonparametric multivariate descriptive measures based on spatial
quantiles. Journal of Statistical Planning and Inference 123 259–278.
[21] Serfling, R. (2010). Equivariance and invariance properties of multivariate quantile and
related functions, and the role of standardization. Journal of Nonparametric Statistics
22 915–936.
[22] Switzer, P. (1970). Numerical Classification. In Geostatistics. Plenum, New York.
[23] Switzer, P. and Wright, R. M. (1971). Numerical classification applied to certain Jamaican eocene nummulitids. Math. Geol. 3 297-311.
[24] Tyler, D. E. (1987). A distribution-free M-estimator of multivariate scatter. Annals of
Statistics 15 234–251.
[25] Tyler, D. E., Critchley, F., Dümbgen, L. and Oja, H. (2009). Invariant co-ordinate
selection. Journal of the Royal Statistical Society, Series B 71 127.
[26] Zuo, Y. and Serfling, R. (2000b). General notions of statistical depth function. Annals
of Statistics 28 461–482.
[27] Zuo, Y. (2003). Projection-based depth functions and associated medians. Annals of
Statistics 31 1460–1490.
15
E
D
G
A
F
B
C
E
D
G
A
F
B
C
E
D
G
A
F
B
C
Figure 1: Triangular data set with 50%, 75%, and 90% outlyingness contours for SUP
(upper), MD (middle), and RMSP (lower), without outliers (left) and with extreme
outliers including a cluster (right).
16
D
A
B
C
A
D
B
C
D
A
B
C
Figure 2: Bivariate Pareto data set with 50%, 75%, and 90% outlyingness contours for
SUP (upper), MD (middle), and RMSP (lower), without outliers (left) and with extreme
outliers including a cluster (right).
17
D
A
B
C
D
A
B
C
D
A
B
C
Figure 3: Bivariate Normal data set with 50%, 75%, and 90% outlyingness contours for
SUP (upper), MD (middle), and RMSP (right), without outliers (left) and with extreme
replacement outliers including a cluster (right).
18
Figure 4: RMSP versus MD: Stackloss data (left) and Pollution data (right).
19