Outlier detection

國立雲林科技大學
National Yunlin University of Science and Technology
On Community Outliers and their Efficient
Detection in Information Networks
Jing Gao, Feng Liang, Wei Fan, Chi Wang, Yizhou Sun, and Jiawei Han
SIGKDD, 2010
Presented by Hung-Yi Cai
2011/2/10
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outlines







2
Motivation
Objectives
Related Work
Methodology
Experiments
Conclusions
Comments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
3

Linked or networked data are ubiquitous in many
applications.

Outlier detection in information networks can reveal
important anomalous and interesting behaviors that
are not obvious if community information is ignored.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objectives


4
To propose an efficient solution by modeling networked data
as a mixture model composed of multiple normal communities
and a set of randomly generated outliers.
The probabilistic model characterizes both data and links
simultaneously by defining their joint distribution based on
hidden Markov random fields (HMRF).
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Related Work
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Methodology
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Methodology

Community Outlier



Fitting Community Outlier Detection Model


7
Outlier Detection via HMRF
Modeling Continuous and Text Data
Inference
Parameter Estimation
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Community Outlier

8
Outlier Detection via HMRF:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Community Outlier

9
Modeling Continuous and Text Data:
─
Continuous data
─
Text data
Intelligent Database Systems Lab
Fitting Community Outlier Detection Model

Inference:
─
─
10
We first assume that the model parameters in Θ are
known, and discuss how to obtain an assignment of the
hidden variables.
The objective is to find the configuration that
maximizes the posterior distribution given Θ.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Fitting Community Outlier Detection Model
11
N.Y.U.S.T.
I. M.

In general, we seek a labeling of the objects, Z = {z1,...,zM}, to
maximize the posterior probability (MAP):

Using the Iterated Conditional Modes (ICM) algorithm to
solve this MAP estimation problem. It adopts a greedy strategy
by calculating local minimization iteratively and the
convergence is guaranteed after a few iterations.
Intelligent Database Systems Lab
Fitting Community Outlier Detection Model

12
Normal community…
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Fitting Community Outlier Detection Model

13
Outlier…
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Fitting Community Outlier Detection Model
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Fitting Community Outlier Detection Model

Parameter Estimation:
─
─
15
In this part, we consider the problem of estimating
unknown Θ from the data.
We view it as an “incomplete-data” problem, and use
the expectation-maximization (EM) algorithm to solve
it.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Fitting Community Outlier Detection Model

16
The outlier component…
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Fitting Community Outlier Detection Model

17
The normal component…
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Fitting Community Outlier Detection Model
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Experiments

19
To conduct experiments on synthetic data to
compare detection accuracy with the baseline
methods, and evaluate on real datasets to
validate that the proposed algorithm can detect
community outliers effectively.
Intelligent Database Systems Lab
Experiments - Synthetic Data

Baseline Method:
GLODA
DNODA
CNA
20
This baseline looks at the data values only. We use the
popular outlier detection algorithm LOF to detect “global”
outliers without taking the network structure into account.
This method only considers the values of each object’s
direct neighbors in the graph.
In this approach, we partition the graph into K
communities using clustering algorithms, and define
outliers as the objects that have significantly different
values compared with the other objects in the same
community.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - Synthetic Data

21
Empirical Results:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - Synthetic Data

22
Sensitivity and Time Complexity:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
N.Y.U.S.T.
I. M.
Experiments - DBLP

23
Sub-network of Conferences:
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - DBLP

Sub-network of Conferences:
─
24
The community outliers detected by the proposed
algorithm include CVPR and CIKM.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - DBLP

Sub-network of Authors:
─
25
Community Outliers in DBLP co-authors.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusions
26

In this paper, discussing a new outlier detection
problem in networks containing rich information,
including data about each object and relationships
among objects.

Proposing a generative model called CODA that
unifies both community discovery and outlier
detection in a probabilistic formulation based on
hidden Markov random fields.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Comments

Advantages
─

Applications
─
─
27
The CODA algorithm outperform the baseline methods
on synthetic and DBLP data.
Outlier Detection
Community Discovery
Intelligent Database Systems Lab