Neighborhood Formation and Anomaly Detection in Bipartite Graphs

Neighborhood Formation and Anomaly
Detection in Bipartite Graphs
Jimeng Sun
Deepayan Chakrabarti
Huiming Qu
Christos Faloutsos
Speaker: Jimeng Sun
Bipartite Graphs
• G={V1 +V2, E} such that edges
are between V1 and V2
• Many applications can be
modeled using bipartite
graphs
• The key is to utilize these
links across two natural
groups for data mining
V1
E
V2
a1
t1
a2
t2
a3
t3
a4
t4
a5
t5
ak
tn
2
Problem Definition
• Neighborhood formation (NF)
• Given a query node a in V1, what
are the relevance scores of all
the nodes in V1 to a ?
V1
V2
.3
.25
.2
a
.25
.05
• Anomaly detection (AD)
• Given a query node a in V1, what
are the normality scores for
nodes in V2 that link to a ?
.05
.01
.002
.01
3
Application I: Publication network
• Authors vs. papers in research communities
• Interesting queries:
• Which authors are most related to Dr.
Carman?
• Which is the most unusual paper written by
Dr. Carman?
4
Application II: P2P network
• Users vs. files in P2P systems
• Interesting queries:
• Find the users with similar preferences to me
• Locate files that are downloaded by users with
very different preferences
users
files
5
Application III: Financial Trading
• Traders vs. stocks in stock
markets
• Interesting queries:
• Which are the most similar
stocks to company A?
• Find most unusual traders
(i.e., cross sectors)
6
Application IV: Collaborative filtering
• collaborative filtering
• recommendation system
Customers Products
7
Outline
•
•
•
•
•
•
•
Problem Definition
Motivation
Neighborhood formation
Anomaly detection
Experiments
Related work
Conclusion and future work
8
Outline
•
•
•
•
•
•
•
Problem Definition
Motivation
Neighborhood formation
Anomaly detection
Experiments
Related work
Conclusion and future work
9
Neighborhood formation – intuition
Input: a graph G and a query node q
Output: relevance scores to q
V1
V2
.3
.2
• random-walk with restart
q
from q in V1
• record the probability visiting .05
.01
each node in V1
.002
• the nodes with higher
.01
probability are the neighbors
10
Exact neighborhood formation
Input: a graph G and a query node q
Output: relevance scores to q
• Construct the transition matrix P where
c
c
• every node in the graph becomes a
state
• every state has a restart probability c
to jump back to the query node q.
• transition probability
• Find the steady-state probability u
which is the relevance score of all the
nodes to q
c
q
c
c
(1-c)
11
Approximate neighborhood formation
• Scalability problem with exact
neighborhood formation:
• too expensive to do for every
single node in V1
• Observation:
• Nodes that are far away from q
have almost 0 relevance scores.
• Idea:
• Partition the graphs and apply
neighborhood formation for the
partition containing q.
12
Outline
•
•
•
•
•
•
•
Problem Definition
Motivation
Neighborhood formation
Anomaly detection
Experiments
Related work
Conclusion and future work
13
Anomaly detection - intuition
• t in V2 is normal if all a in V1 that
link to t belong to the same
neighborhood
• e.g.
t
high normality
t
low normality
14
Anomaly detection - method
Input: a query node q from V2
Output: the normality score of q
• Find the set of nodes
connected to q, say S
• Compute relevance scores of
elements in S, denoted as rs
• Apply score function f(rs) to
obtain normality scores:
S
q
• e.g. f(rs) = mean(rs)
15
Outline
•
•
•
•
•
•
•
Problem Definition
Motivation
Neighborhood formation
Anomaly detection
Experiments
Related work
Conclusion and future work
16
Datasets
datasets
|V1|
|V2|
ConferenceAuthor (CA)
AuthorPaper (AP)
2687 288K 662K
IMDB
553K 204k 2.2M
316K 472K
|E|
1M
Avgdeg(V1)
Avgdeg(V2)
510
5
3
2
4
11
17
Goals
[Q1]: Do the neighborhoods make sense? (NF)
[Q2]: How accurate is the approximate NF?
[Q3]: Do the anomalies make sense? (AD)
[Q4]: What about the computational cost?
18
[Q1] Exact NF
Robert DeNiro (IMDB)
relevance score
relevance score
ICDM (CA)
most relevant neighbors
most relevant neighbors
• The nodes (x-axis) with the highest relevance
scores (y-axis) are indeed very relevant to the
query node.
• The relevance scores can quantify how
close/related the node is to the query node. 19
[Q2] Approximate NF
neighborhood size = 20
Precision
Precision
num of partitions = 10
# of partitions
neighborhood size
• Precision = fraction of overlaps between
ApprNF and NF among top k neighbors
• The precision drops slowly while increasing the
number of partition
• The precision remain high for a wide range of
neighborhood size
20
normality score
[Q3] Anomaly detection
• Randomly inject some nodes and edges (biased
towards high-degree nodes)
• The genuine ones on average have high
normality score than the injected ones
21
[Q4] Computational cost
Time(sec)
Approximate NF
# of Partitions
• Even with a small number of partitions,
the computational cost can be reduced
dramatically.
22
Related Work
• Random walk
[Brin & Page98] [Haveliwala WWW02]
• Graph partitioning
[Karypis and Kumar98] [Kannan et al. FOCS00]
• Collaborative filtering
[Shardanand&Maes95] …
• Anomaly detection
[Aggarwal&Yu. SIMOD01] [Noble&Cook KDD03]
[Newman03]
23
Conclusion
• Two important queries on bipartite graphs:
NF and AD
• An efficient method for NF using randomwalk with restart and graph partitioning
techniques
• Based the result of NF, we can also spot
anomalies (AD)
• Effectiveness is confirmed on real datasets
24
Future work and Q & A
• Future work
• What about time-evolving graphs?
• Contact:
Jimeng Sun
[email protected]
http://www.cs.cmu.edu/~jimeng
25

Download Report

Neighborhood Formation and Anomaly Detection in Bipartite Graphs

Paperzz.com

Your Paperzz