Mining Graph Patterns Efficiently via Randomized Summaries

Mining Graph Patterns Efficiently via
Randomized Summaries
Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng
Yan, Jiawei Han
VLDB’09
Outline






Motivation
Preliminaries
SUMMARIZE-MINE FRAMEWORK
Bounding the False Negative Rate
Experiments
Conclusion
Motivation

Graphs Pattern Mining are heavily needed in many real applications, such as
bioinformatics, hyperlinked webs and social network analysis.

Unfortunately, due to the fundamental role subgraph isomorphism plays in
existing methods, they may all enter into a pitfall when the cost to enumerate a
huge set of isomorphic embeddings blows up, especially in large graphs with
few identical labels.
Motivation

Consider possible ways to reduce the number of embeddings. In particular,
since in real applications, many embeddings overlap substantially, we explore
the possibility of somehow “merging” these embeddings to significantly reduce
the overall cardinality.
Preliminaries
SUMMARIZE-MINE FRAMEWORK





Summarization: For raw database with frequency threshold min_sup, we
bind vertices with identical labels into a single node and collapse the
network correspondingly into a smaller summarized version. This step
generalizes our view on the data to a higher level.
Mining: Apply any state-of-art frequent subgraph mining algorithm
on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly
lowered support threshold min sup’ , which generates the pattern set
FP(D’).
Verification: Check patterns in FP(D’) against the original database
D, remove those p ∈ FP(D’) whose support in D is less than min sup
and transform the result collection into R’
Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from
each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from
different iterations, the final result is R’ = R’1 ∪ R’2 ∪…∪ R’t. This step
is to guarantee that the overall probability of missing any frequent pattern is
bounded.
Deal with false positive and false negative.
Raw DB
Summarized DB
SUMMARIZE-MINE FRAMEWORK





Summarization: For raw database with frequency threshold min_sup, we
bind vertices with identical labels into a single node and collapse the
network correspondingly into a smaller summarized version. This step
generalizes our view on the data to a higher level.
Mining: Apply any state-of-art frequent subgraph mining algorithm
on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly
lowered support threshold min sup’ , which generates the pattern set
FP(D’).
Verification: Check patterns in FP(D’) against the original database
D, remove those p ∈ FP(D’) whose support in D is less than min sup
and transform the result collection into R’
Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from
each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from
different iterations, the final result is R’ = R’1 ∪ R’2 ∪…∪ R’t. This step
is to guarantee that the overall probability of missing any frequent pattern is
bounded.
Deal with false positive and false negative.
Raw DB
Summarized DB
SUMMARIZE-MINE FRAMEWORK



Take gSpan as the skeleton of mining algorithm
Each labeled graph pattern can be transformed into a sequential representation called
DFS code
With a defined lexicographical order on the DFS code space, all subgraph patterns can be
organized into a tree structure, where

1. patterns with k edges are put on the k th level

2. a preorder traversal of this tree would generate the DFS codes of all possible
patterns in the lexicographical order
SUMMARIZE-MINE FRAMEWORK

According to DFS lexicographic order,
SUMMARIZE-MINE FRAMEWORK





Summarization: For raw database with frequency threshold min_sup, we
bind vertices with identical labels into a single node and collapse the
network correspondingly into a smaller summarized version. This step
generalizes our view on the data to a higher level.
Mining: Apply any state-of-art frequent subgraph mining algorithm
on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly
lowered support threshold min sup’ , which generates the pattern set
FP(D’).
Verification: Check patterns in FP(D’) against the original database
D, remove those p ∈ FP(D’) whose support in D is less than min sup
and transform the result collection into R’
Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from
each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from
different iterations, the final result is R’ = R’1 ∪ R’2 ∪…∪ R’t. This step
is to guarantee that the overall probability of missing any frequent pattern is
bounded.
Deal with false positive and false negative.
Raw DB
Summarized DB
SUMMARIZE-MINE FRAMEWORK
False Embeddings  False Positives

Reduce false positives



Technique 1: Bottom-up
sup(p1) > sup(p2) >min_sup
Technique 2: Top-down
min_sup > sup(p1) > sup(p2)
It is guaranteed that there is no false positives.
SUMMARIZE-MINE FRAMEWORK





Summarization: For raw database with frequency threshold min_sup, we
bind vertices with identical labels into a single node and collapse the
network correspondingly into a smaller summarized version. This step
generalizes our view on the data to a higher level.
Mining: Apply any state-of-art frequent subgraph mining algorithm
on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly
lowered support threshold min sup’ , which generates the pattern
set FP(D’).
Verification: Check patterns in FP(D’) against the original database
D, remove those p ∈ FP(D’) whose support in D is less than min sup
and transform the result collection into R’
Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from
each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from
different iterations, the final result is R’ = R’1 ∪ R’2 ∪…∪ R’t. This step
is to guarantee that the overall probability of missing any frequent pattern is
bounded.
Deal with false positive and false negative.
Raw DB
Summarized DB
SUMMARIZE-MINE FRAMEWORK
SUMMARIZE-MINE FRAMEWORK





Summarization: For raw database with frequency threshold min_sup, we
bind vertices with identical labels into a single node and collapse the
network correspondingly into a smaller summarized version. This step
generalizes our view on the data to a higher level.
Mining: Apply any state-of-art frequent subgraph mining algorithm
on the summarized database D’ = {S1, S2, . . . , Sn} with a slightly
lowered support threshold min sup’ , which generates the pattern
set FP(D’).
Verification: Check patterns in FP(D’) against the original database
D, remove those p ∈ FP(D’) whose support in D is less than min sup
and transform the result collection into R’
Iteration: Repeat steps 1, 2 and 3 for t times, and combine the results from
each iteration. Let R’1,R’2, . . . ,R’t be the patterns obtained from
different iterations, the final result is R’ = R’1 ∪ R’2 ∪…∪ R’t. This step
is to guarantee that the overall probability of missing any frequent pattern is
bounded.
Deal with false positive and false negative.
Raw DB
Summarized DB
Bounding the False Negative Rate
Miss Embeddings  False Negatives
q(p)
The
probability
all mj vertices
label lj are assigned to xj different
Multiplying
thethat
probabilities
for allwith
L labels
groups (and thus f continues to exist) is
Bounding the False Negative Rate
Bounding the False Negative Rate
It
is NOT guaranteed that there is no false negaitives, but the possibility is
bounded by

The false negative rate after t iterations is (1−P)t. To make (1−P)t less
than some small


Technique 1: For raw database with frequency threshold min_sup, we adopt a lower
frequency threshold min_sup’ for summarized database.
Technique 2: Iterate the mining steps for t times and combine the results generated
in each time.
Experiments
Experiments
Conclusion

Isomorphism test on small graphs is much more easier.

Each graph does iteration t times to reduce the false negative
rate, t = ?