GLS-SOD: A Generalized Local Statistical

GLS-SOD: A Generalized Local Statistical
Approach for Spatial Outlier Detection
Feng Chen, Chang-Tien Lu, Arnold P. Boedihardjo
Virginia Tech, Computer Science Department
July, 28, 2010
KDD ’2010, Washington, DC, USA
1
Outline
„
„
„
„
„
Motivations
G
Generalized
li d Locall Statistical
S i i l (GLS)
(G S) Model
d l
GLS Robust Estimation & Inferences
Simulations
Summary
2
Applications
3
What’s Special for Spatial Data?
„
„
By the first law of Geography, “Everything is
related to everything else,
else but nearby things are
more related than distant things” [Tobler 79]
Spatial autocorrelation
a toco elation
„
Correlation of a variable with itself through space
4
Global based Spatial Outlier Detection
5
Local based Spatial Outlier Detection
Local (laplacian) smoothing
swamping
masking
6
Why GLS-SOD?
„
Global based
„
„
„
Local based
„
„
„
Pros: high accuracy with statistical justifications
justifications.
Cons: very slow; complicate estimation process; nonconvex optimization
p
Pros: veryy fast;; simplicity
p y
Cons: heuristic-driven, lack of statistical justifications
Quest o s
Questions
„
„
„
Local (laplacian) smoothing vs. spatial dependence?
Statistical connections between local and global methods?
When will existing local based methods perform poorly
and how to handle these situations?
7
Gaussian Random Field
Large scale trend
White noise variation
Small Scale Variation
8
Generalized Local Statistical Model
Generalized local statistical model (GLS)
≈ (See theorem 3)
C
Convolution
l ti effects
ff t
9
Generalized least Squares
10
Forward / Backward Search
„
GLS Backward Search
„
„
„
„
Model estimation byy g
generalized least squares
q
Remove the most probable outlier and update all local
differences
R
Repeat
t until
til the
th p values
l
off allll existing
i ti objects
bj t are greater
t
than a threshold (e.g., 0.025)
GLS Forward Search
„
„
„
„
Estimation by a robust subset S of local differences
Add
dd test objects o
one
e by o
one
e to tthe
e ttraining
a
g set S
Check the change of the smallest p value in S
A large drop in the smallest p value indicates an outlier
swamping
masking
11
GLS Z-Test Statistics
12
Connections with Existing Methods
„
If F∑FT = σ2I, then GLS-SOD is equivalent to
Universal Kriging SOD.
„
„
Local vs. global estimator for a spatial Gaussian
random field.
field The key is which estimator is more
robust.
When F∑FT = σ2I,
I FXB = μII and FF=
FF σ02I,
I
then Local based SOD is equivalent to GLSSOD and Universal Kriging SOD.
„
Local based SOD is a special case of GLS-SOD
13
Experiments
„
Standard Simulation Model for SOD
„
„
864 different
diff
t simulation
i l ti settings.
tti
Six
Si repetitions
titi
for each setting; consider average error
Existing statistical SOD methods only consider 10
to 15 simulation settings in their experiments.
14
Simulation Results
15
Summary
„
„
„
„
Design of a generalized local statistical
framework
Robust estimation and outlier detection methods
b d on the
based
th proposed
d GLS framework
f
k
In-depth study on the connection between
diff
different
t SOD methods
th d
Comprehensive simulations to validate the
effectiveness
ff
and
d efficiency
ff
off GLS
16
Thank you !
[email protected]
h f@ t d
17
Property of “F∑FT” (Theorem 2)
18
Property of “σ02FFT”
Conclusion: When selecting
g a relatively large
g neighborhood
g
size to do local smoothing, we can approximate “FFT” as an
identity matrix.
19
Connections with Existing Methods
20
Simulations
„
Simulation Model and Settings
„
864 different
diff
t simulation
i l ti settings.
tti
Six
Si repetitions
titi
for each setting; consider average error
Exponential Model
C t i ti
Contaminations
Spherical Model
21
Simulation Results
22
23