A Convex Hull Peeling Depth Approach to Nonparametric Massive

A Convex Hull Peeling Depth Approach to
Nonparametric Massive Multivariate Data
Analysis with Applications
Hyunsook Lee
.
[email protected]
Department of Statistics
The Pennsylvania State University
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 1
Outlines
I
Convex Hull Peeling (CHP) and Multivariate Data Analysis
Definitions on CHP
Data Depth (Ordering Multivariate Data)
Quantiles and Density Estimation
I
Color Magnitude (CM) Diagram and Sloan Digital Sky Survey
I
Nonparametric Descriptive Statistics with CHP
Multivariate Median
Skewness and Kurtosis of a Multivariate Distribution
I
Outlier Detection with CHP
Level α ; Shape Distortion; Balloon Plot
I
Concluding Remarks
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 2
Definitions
A set C ⊆ Rd is convex if for every two points x, y ∈ C the
whole segment xy is also contained in C.
Convex Set
The convex hull of a set of points X in Rd is denoted by
CH(X), is the intersection of all convex sets in Rd containing X. In
Convex Hull
algorithms, a convex hull indicates points of a shape invariant minimal
subset of CH(X) (vertices, extreme points), connecting these points
2
1
0
−2
−1
y
0
−1
−2
y
1
2
produces a wrap of CH(X).
−2
−1
0
1
2
−2
−1
0
1
2
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 3
x
x
Convex Hull Peeling
1
0
−2
−1
y
0
−1
−2
y
1
2
After
2
Before
−2
−1
0
x
1
2
−2
−1
0
1
2
x
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 4
Convex Hull Peeling Depth (CHPD)
[CHPD:] For a point x ∈ Rd and the data set X = {X1 , ..., XN −1 }, let
C1 = CH{x, X} and denote a set of its vertices V1 . We can get
Cj = Cj−1 \Vj−1 through CHP until x ∈ Vj (j = 2, ...). Then,
CHPD(x) =
is 0.
](∪k
i=1 Vi )
for
N
k s.t. k = minj {j : x ∈ Vj } ; otherwise CHPD
I
Tukey (1974): Locating data center (median) by the Convex Hull Peeling
Process.
I
Barnett (1976): Ordering based on Depth
I
p̂th quantiles are 1 − p̂th CHPDs.
I
Hyper-polygons of 1 − p̂th depth obtainable from any dimensional data.
I
QHULL(Barber et. al., 1996) works for general dimensions (http://qhull.org).
I
Why CHPD...
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 5
Challenges in
Nonparametric Multivariate Analysis
How to Order Multivariate Data?
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 6
Challenges in
Nonparametric Multivariate Analysis
How to Order Multivariate Data?
Ordering Multivariate Data → Data Depth
I
Mahalanobis Depth : Mahalanobis (1936)
I
Convex Hull Peeling Depth: Barnett (1976)
I
Half Space Depth: Tukey (1975)
I
Simplical Depth : Liu (1990)
I
Oja Depth : Oja (1983)
I
Majority Depth : Singh (1991)
I
Ordering is not uniformly defined
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 6
Statistical Data Depth
(Zuo and Serfling, 2000)
(P1) (Affine invariance) D(Ax + b; FAX+b ) = D(x; FX ) for all X (A
nonsingular matrix) holds for any random vector X in Rd , any d × d
nonsingular matrix A, and any d-vector b;
(P2) (Maximality at center) D(θ; F ) = supx∈Rd D(x; F ) holds for any
F ∈ F having center θ;
(P3) (Monotonicity) for any F ∈ F having deepest point θ,
D(x; F ) ≤ D(θ + α(x − θ); F ) holds for α ∈ [0, 1]; and
(P4) D(x; F ) → 0 as ||x|| → ∞, for each F ∈ F.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 7
Convex Hull Peeling Depth
I
affine invariance
I
maximality at center
I
monotonicity relative to deepest point
I
vanishing at infinity
CHPD has these properties and points of smallest depth are possible
outliers
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 8
Quantile Estimation
I
I
I
Median: A point(s) left after peeling
(will show robustness of this estimator later)
pth Quantile: Level set whose central region contains ∼ 100p% data
(will define the level set and the central region later)
No Closed Form; Empirical Process
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 9
Empirical Density Estimation
0.15
0.00
−4
−4 −3 −2 −1
−4
−2
0
2
4
3
12
0
−1
−2
−3
0
1
2
3
y
0.05
0.10
density
0
−2
y
2
0.20
4
Density Estimation with CHPD on Bivariate Normal Data (McDermott, 2003)
100000 Bivariate Normal Sample
Quantiles={0.99,0.95,0.90,0.80,...0.20,0.10,0.05,0.01}
4
x
x
I
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 10
Lessons and Further Studies
I
Sample from a convex distribution (no doughnut shape)
I
Works on Massive data
−→ Sequential Method
I
Without previous knowledge, no model or prior is known to start an
analysis. Exploratory data analysis for a large database
I
Nonparametric and non-distance based approach
I
Where CHP can be applied and how?
−→ Multi-color diagram from astronomy, where a plethora of free
data archives is available.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 11
Color Magnitude diagram
Two dimensional Color-Color diagram or
Celebrated Hertzsprung-Russell diagram (switch)
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12
Color Magnitude diagram
Two dimensional Color-Color diagram or
Celebrated Hertzsprung-Russell diagram (switch)
What if we can see beyond 2 dimensions without bias (projection)
Then, 3 or higher dimensional color diagrams might have popularity.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12
Color Magnitude diagram
Two dimensional Color-Color diagram or
Celebrated Hertzsprung-Russell diagram (switch)
What if we can see beyond 2 dimensions without bias (projection)
Then, 3 or higher dimensional color diagrams might have popularity.
CHP may assist analyzing multi-color diagrams.
Need a suitable data set with colors.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 12
Sloan Digital Sky Survey: SDSS
Commissioned 2000, now Data Release 5 is available.
5 bands; 4 variables (u-g, g-r, r-i, i-z)
I
Studies on analyzing astronomical massive data received spotlights
recently. http://www.sdss.org
I
July, 2005: Data Release Four
6670 square degrees, 180 million objects
Available from http://www.sdss.org/dr4
From SpecPhotoAll with SQL:
I
Attributes of photometric data are color indices, u,b,g,i,z along with
coordinates.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 13
SQL for SDSS
select ra, dec, z, psfMag_u, psfMag_g, psfMag_r,
psfMag_i, psfMag_z
from SpecPhotoAll
where specclass= 2
I
Note — 2: galaxies, 3: QSO, 4: HighZ QSO
I
Galaxies: 499043
I
Quasars: 70204
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 14
Multivariate Descriptive Statistics
I
CHP Median
I
CHP Skewness
I
CHP Kurtosis
with bivariate simulated data and SDSS DR4
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 15
Convex Hull Peeling Median (CHPM)
Multivariate Median: the inner most point among data
→ Survey of Multivariate Median (Small, 1990)
CHPM: recursive peeling leads to the inner most point(s). The average
of these largest depth points is the median of a data set.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 16
Convex Hull Peeling Median (CHPM)
Multivariate Median: the inner most point among data
→ Survey of Multivariate Median (Small, 1990)
CHPM: recursive peeling leads to the inner most point(s). The average
of these largest depth points is the median of a data set.
Simulations: Sample from standard bivariate normal distribution
n
mean
median
CHPM
104
(0.001338, -0.02232)
(-0.005305, -0.01643)
(0.000918, -0.010589)
106
(0.000072, 0.000114)
(0.001185, -0.000717)
(0.002455, -0.000456)
Sequential CHPM →
(0.004741, -0.004111)
Setting for the sequential method: m=10000 and d=0.05
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 16
Application: Median
Quasars
u-g
g-r
r-i
i-z
Mean
0.4619
0.2484
0.1649
0.1008
Median
0.2520
0.1750
0.1520
0.0770
CHPM
0.2530
0.1640
0.1913
0.0700
Galaxies
u-g
g-r
r-i
i-z
Mean
1.622
0.9211
0.4226
0.3439
Median
1.680
0.8930
0.4200
0.3540
CHPM
1.790
0.957
0.424
0.367
Seq. CHPM
1.772
0.950
0.4228
0.363
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 17
Robustness of Convex Hull Peeling Median
Breakdown point of a convex hull peeling median goes to zero as
n → ∞ (Donoho, 1982). Outliers are necessarily located at infinity.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 18
Robustness of Convex Hull Peeling Median
Breakdown point of a convex hull peeling median goes to zero as
n → ∞ (Donoho, 1982). Outliers are necessarily located at infinity.
Empirical mean square error (EMSE) and Relative Efficiency (RE):
Model:(1 − )N ((0, 0), I) + N (·, 4I)
n = 5000, m = 500, Tj =(CHPM, Mean)
Pm
1
EM SE = m i=1 ||Tj − µ||2
N ((5, 5)t , 4I)
N ((10, 10)t , 4I)
CHPM
Mean
RE
CHPM
Mean
RE
0
0.002178
0.000417
0.191689
0.002178
0.000417
0.191689
0.005
0.0028521
0.001682
0.589961
0.002891
0.005444
1.88291
0.05
0.016842
0.125522
7.45262
0.017824
0.500610
28.08597
0.2
0.139215
2.00109
14.37612
0.1435910
8.0017
55.7264
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 18
Generalized Quantile Process
EinMahl and Mason (1992)
Un (t) = inf {λ(A) : Pn (A) ≥ t, A ∈ A}, 0 < t < 1.
I
I
Central Region:
RCH (t) = {x ∈ Rd : CHP D(x) ≥ t}
Level Set:
BCH (t)
= ∂RCH (t)
= {x ∈ Rd : CHP D(x) = t}
I
Volume Functional:
VCH (t) = V olume(RCH (t))
−→ One dimensional mapping.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 19
8
y
0
−4
2
4
6
4
2
0
−2
y
−4
−2
0
2
4
−4
−2
0
4
x
−200
0 100
y
0
−3 −2 −1
y
1
2
300
3
x
2
−3
−2
−1
0
x
1
2
3
−50
0
50
100
x
→ not equi-probability contours, assume smooth convex distributions
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 20
Skewness Measure
Let xj,i be the ith vertex in a level set BCH,j comprised by the j th
peeling process. A measure of skewness:
Rj =
maxi ||xj,i − CHP M || − mini ||xj,i − CHP M ||
mini ||xj,i − CHP M ||
Not only a sequence of Rj visualizes but also quantizes the skewness
along depths.
Denominator for the regularization → affine invariant Rj
symmetric:
skewed:
flat Rj along convex hull peels
fluctuating Rj
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 21
Simulation: Skewness Measure
N(0, Σ)
non normal
2
1
−3
−2
−2
−2
−1
0
N(0, 1)
0
0
2
2
3
N(0, I)
−3 −2 −1
0
1
2
3
−4
−2
0
2
5
10
15
20
25
30
3
0.0
2
1.0
0.5
1.5
Rj
4
2.0
Rj
1.5
1.0
Rj
2.0
5
2.5
2.5
6
χ210
0
20
40
60
80
jthconvex hull level set
2000
0
20
40
60
80
jthconvex hull level set
0
20
40
60
80
jthconvex hull level set
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 22
8
6
4
2
Rj
10
12
14
Application:Skewness Measure (Quasars)
0
20
40
60
80
jthconvex hull level set
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 23
2
4
6
Rj
8
10
12
Application:Skewness Measure (Galaxies)
0
50
100
150
200
jthconvex hull level set
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 24
Kurtosis Measure
Quantile (Depth) based Kurtosis:
VCH ( 21 − r2 ) + VCH ( 12 + 2r ) − 2VCH ( 21 )
KCH (r) =
VCH ( 21 − 2r ) − VCH ( 12 + 2r )
Tailweight:
t(r, s) =
VCH (r)
VCH (s)
for 0 < s < r ≤ 1. Here,
VCH (r) indicates the volume functional at depth r.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 25
1.0
Simulation: Kurtosis Measure (Tailweight)
0.6
0.4
0.2
0.0
t(r, rmin)
0.8
uniform
normal
t10
cauchy
1.0
0.8
0.6
0.4
0.2
0.0
r : depth
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 26
0.0
0.2
0.4
v(r, rmin)
0.6
0.8
1.0
Application: Kurtosis Measure (Quasars)
1.0
0.8
0.6
0.4
0.2
0.0
depth
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 27
Multivariate Outlier Detection
I
I
What are Outliers?
Detecting Algorithms
Level α
Shape Distortion
Balloon Plot
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 28
What are Outliers?
Outliers are...
I
Cumbersome Observations
I
Lead to New Scientific Discoveries
I
Improve Models (Robust Statistics)
I
...
I
No Clear Objectives but Come Along Often
CHP: Experience and relative Robustness support the Idea of Outlier
Detection.
⇒ We need a clear definition on outliers; especially, outliers of the 21st
century. And outlier detecting methods.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 29
Outliers are observations....
I
Huber (1972): unlikely to belong to the main population.
I
Barnett and Leroy (1994): appear inconsistent with the remainder.
I
Hawkins (1980): deviated so much to arouse suspicion.
I
Beckman and Cook (1983): surprising and discrepant to the
investigator.
Discordant Observations or Contaminants
I
Rohlf (1975): somewhat isolated from the main cloud of points.
Yet, somewhat VAGUE!
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 30
Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ...
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31
Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ...
Multivariate: Mostly bivariate applications
I
Generalized Gap Test (Rolhf, 1975)
I
Bivariate Box Plot (Zani et. al, 1999)
I
Sunburst Plot (Liu et. al., 1999)
I
Bag plot (Miller et. al., 2003)
and Mahalanobis distance D(x) = (x − µ̂)Σ̂−1 (x − µ̂).
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31
Some Outlier Detection Methods
Univariate: Box-and-Whisker plot, Order statistics, ...
Multivariate: Mostly bivariate applications
I
Generalized Gap Test (Rolhf, 1975)
I
Bivariate Box Plot (Zani et. al, 1999)
I
Sunburst Plot (Liu et. al., 1999)
I
Bag plot (Miller et. al., 2003)
and Mahalanobis distance D(x) = (x − µ̂)Σ̂−1 (x − µ̂).
Difficulties of multivariate analysis arise from the complexity of ordering
multivariate data.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 31
Quantile Based Outlier Detection
bivariate t5 with ρ=−0.5
−20
−4
−10
0
y
0
−2
−4
−2
0
2
4
−20
−10
0
0.20
0.15
x
1
2
3
4
0.00
0
3
12
−10
−2
−3
y
0.05
0.10
density
0.15
density
0.10
0.05
0.00
−4 −3 −2 −1
20
x
0.20
x
10
−6
−4
−2
0
x
2
4
6
2
−20
−6−4
46
y
y
2
10
4
20
bivariate standard normal
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 32
Contour Shape Changes
Outliers Added
−2
−2
0
y
0
y
2
2
4
Bivariate Normal Sample
−3
−2
−1
0
1
2
3
−2
0
2
4
6
2
4
6
x
−2
−2
0
y
0
y
2
2
4
x
−3
−2
−1
0
1
2
3
−2
0
x
50
40
30
Volj
10
20
GAP
0
0
10
20
Volj
30
40
50
x
0
20
40
60
jthconvex hull level set
80
0
20
40
60
80
jthconvex hull level set
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 33
Balloon Plot
Outliers Added
−2
−2
0
y
0
y
2
2
4
Bivariate Normal Sample
−3 −2 −1
0
1
2
3
−2
0
2
4
6
x
x
A Balloon Plot is obtained
by blowing .5th CHPD polyhedron
by 1.5
times (lengthwise). Let V.5 be a set of vertices of .5th CHPD hull. The
balloon for outlier detection is
B1.5 = {yi : yi = xi + 1.5(xi − CHP M ), xi ∈ V.5 }.
In other words, blow the balloon of IQR 1.5 times larger.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 34
3
0
0
1
2
Vol0.25
j
200
100
Volj
300
4
400
Outliers in Quasar Population
0
20
40
60
jthconvex hull level set
80
0
20
40
60
80
jthconvex hull level set
Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (474.134, 14.442,
4.353)
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 35
0
0
2
4
Vol0.25
j
2000
1000
Volj
3000
6
4000
8
5000
Outliers in Galaxy Population
0
50
100
j (peel)
150
200
0
50
100
150
200
j(peel)
Volumes of 1st CH, .01 Depth CH, .05 Depth CH: (4919.492, 4.310,
1.075)
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 36
Discussion on CHP
Convex Hull Peeling is..
I
a robust location estimator.
I
a tool for descriptive statistics.
Skewness and Kurtosis measure.
I
a reasonable approach for detecting multivariate outliers.
I
a starter for clustering.
⇒Our methods help to characterize multivariate distributions and
identify outlier candidates from multivariate massive data; therefore,
the results initiate scientists to study further with less bias.
CHP as Exploratory Data Analysis and Data Mining Tools.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 37
Concluding Remarks
Drawbacks of CHPD
I
Limited to moderate dimension data.
I
CHPD estimates depths inward not outward.
I
Convexity of a data set.
I
No population/theoretical counterpart.
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 38
Concluding Remarks
Drawbacks of CHPD
I
Limited to moderate dimension data.
I
CHPD estimates depths inward not outward.
I
Convexity of a data set.
I
No population/theoretical counterpart.
No assumption on data distribution, Non-distance
based, Affine invariant, Applicable to streaming data,
Detecting Outliers, Providing Multivariate Descriptive
Statistics, Exploratory data analysis
Hyunsook Lee, Department of Statistics, Penn State Univ – p. 38