Some Preliminary Results of a Graph Mining Algorithm Design

Some Preliminary Results of a Graph Mining Algorithm Design:
Finding Frequent Directed Induced Sub-graphs
Sen Zhang, Mathematics, Computer Science and Statistics, SUNY Oneonta
Abstract
Graph structure is useful for representing complex non-linear relationships among entities in a
wide variety of information systems. As massive amount of non-linear structured data become
increasingly available to the public in the applications such as traffic network, social network,
biological network, and technological and information networks, graph mining research has
been becoming an active research in recent decade.
Graph Mining problems have many variants due to various restrictions. Literature surveys and
some preliminary results of an on-going graph mining algorithm research are shared in this
presentation. General challenges of graph mining problems, possible restricted variants and
possible optimization techniques are discussed.
g1 g2 g3 g1 0 ? ? g2 ? 0 g3 ? g1 g2 g4 g1 0 ? ? ? g2 ? 0 ? 0 g4 ? ? g1 g2 g3 g1 0 ? ? g2 ? 0 ? g3 ? ? 0 C(10,3)=120
2^6=64 vs 13
g1 g2 g3 g1 0 0 ? g2 1 0 ? g3 ? ? 0 g1 g2 g3 g1 0 0 0 g2 1 0 g3 0 0 ? ? ? ? g1 g2 g7 g1 0 ? ? ? g2 ? 0 0 g7 ? ? g1 g2 g8 g1 g2 g9 g1 0 ? ? g1 0 ? ? ? g2 ? 0 ? g2 ? 0 ? 0 g8 ? ? 0 g9 ? ? 0 0 0 g1 g2 g3 g1 0 0 0 0 g2 1 0 0 g3 0 0 g1 g2 g3 g1 0 0 1 0 g2 1 0 0 g3 1 0 g1 g2 g3 g1 0 0 1 1 g2 1 0 0 g3 0 0 g1 g2 g3 g1 0 0 0 0 g2 1 0 0 g3 1 0 g1 g2 g3 g1 0 0 0 1 g2 1 0 0 g3 0 1 g1 g2 g3 g1 0 0 1 1 g2 1 0 0 g3 1 0 g1 g2 g3 g1 0 0 1 0 g2 1 0 0 g3 0 1 g1 g2 g3 g1 0 0 0 1 g2 1 0 0 g3 1 1 g1 g2 g3 g1 0 0 0 0 g2 1 0 0 g3 0 1 g1 g2 g3 g1 0 0 1 0 g2 1 0 0 g3 1 1 g1 g2 g3 g1 0 0 1 1 G2 1 0 1 0 g3 0 1 0 g1 g2 g3 g1 0 0 0 0 g2 1 0 0 g3 1 1 g1 g2 g3 g1 0 0 1 1 G2 1 0 1 0 g3 1 1 0 F=empty G={d1,d2,d3,d4} A partial A-priori lattice
of pattern growing through
joining
F={g1g2} F={g2g8} F={g8g5} F={g5g1} F={g1g9} F={g3g4} F={g8g7} F={g9g8} F={g7g6} S={d2} S={d2} S={d2} S={d2} S={d1, d4} S={1} S={d1} S={d1} S={d1} F={g1g2, g2g8} F={g2g8, g8g5} F={g8g5, g5g1} F={g1g2, g5g1} F={g1g9, g9g8} F={g9g8, g8g7} F={g8g7, g7g6} S={d2} S={d2} S={d2} S={d2} S={d1} S={d1} S={d1} F={g1g2, g2g8,g8g5} F={g2g8,g8g5,g5g1} F={g1g2,g5g1,g8g5} F={g1g2,g5g1,g2g8} S={d2} S={d2} F={g1g2, g2g8,g8g5, g5g1} S={d2} S={d2} S={d2} F={g1g9, g8g7, g9g8} F={g8g7, g9g8, g7g6} S={d1} S={d1} F={g1g9, g8g7, g9g8, g7g6} C(4,2)=6
S={d1} Variants of graph mining problems
Directed vs undirected
Unique labels vs. repetitive labels
donnected patterns or disconnected patterns
strongly connected or weakly connected
Induced vs embedded (with various nuances)
We start from induced sub-graphs with respect to unique
label directed graphs, with extensions to other variants and
penalization being considered in algorithm design
Related Work
gSpan: fficient canonical labeling to reduce redundancy
AGM and FSG: Adatping frequent itemset mining
algorithms to graph mining
SPIN and GASTON: Mining and extending simple
subgraphs (trees, paths)
MULE and RNGV: DFS based - Unique labeled directed
Graph, exact and inexact subgraph minining
GraphGrep: mainly a graph indexing and searching
algorithms using suffix paths
Otminer: Restrictedly Embedded Ordered Tree Mining
Algorithm
Frequent itemset and frequent sequence algorithms
A brute-force way to generate all candidate
subgraphs level by level through efficient
AM-based rightmost expansion operations
using two levels of enumerations :1) all
possible node extensions 2) all possible
edge extensions.
Without using apriori-prune techniques, the
number of all combinations will explode.
g3 0 g6 g5 0 1 0 0 0 g3 0 0 1 0 ? ? g2 1 g2 g2 0 g2 ? ? 1 0 ? ? 0 0 ? ? 0 0 0 0 g1 g1 g1 g1 g3 g3 g6 g5 g2 g2 g2 g2 g1 g1 g1 g1 Challenges
The exponential nature of the number of subgraphs.
Subgraph Isomorphism
Fro counting frequencies, it is necessary to check
whether a given graph is a subgraph of another one.
In general form, subgraph checking is an NPcomplete.
Cannonical Labeling
To avoid redundancy while generating subgraphs,
canonical labeling of graphs is necessary.
Equivalent to subgraph isomorphism
Connectivity
Patterns of Interest are generally connected, so it is
necessary to only generate connected subgraphs.
weakly or strong connected. However, disconnected
patterns might be also meaningful.
Several Issues Under Study
- New canonical form design and properties of the
canonical form
- Optimize existing or design new subgraph enumeration
Algorithm
- Optimize existing or design new subgraph detection
Algorithm
- These techniques are then integrated into an efficient
algorithm to solve the graph mining problem at hand.
Aim of the project
The ultimate goal of the research is to lead to designing, implementing and
evaluating novel and relatively efficient algorithms to mine frequent sub-graphs
from graph databases under specially defined restrictions. This is an on-going
research.
Research Plan
1. Extend the author’s previous tree mining research to graph mining.
2. Conduct literature review of existing graph mining algorithms
3. Develop Synthetic dataset generators.
4. Obtain a small amount of real world dataset
5. Identify new graph mining problems
6. Understand and Optimize Existing graph mining algorithms
7. Design and implement new graph mining algorithms
8. Investigate parallelizable components of Graph Mining Algorithms
Apriori-Prune and Other Filter Techniques
- If a graph is frequent, then all its minors must also be frequent.This property applies recursively level by
level by removing substrures the graph. However, in algorithm design, this properties will be used bottom
up, inversely. If a subgraph is infrequent, then its supergraph must be infrequent. In logic, an inverse is a
type of conditional sentence which is an immediate inference made from another conditional sentence.
- -If the intersection set of supporting graphs of two subgraphs to be joined is infrequent, then no need to
join.
- If labels are infrequent, then no need to test on edges.
- Graph scattering, indexing and gathering techniques and Graph aggregation statistics
Algorithm1: lE-TLE
Step 1: Find all frequent labels and the supporting graph list
Step 2: Find all possible label pairs, pruned by the intersection list of the supporting
transaction lists. For each frequent label pairs, find all possible edge combinations. Verify the
support of the candidate
Step 3: Grow patterns level by level, using two-stage enumerations: by nodes, and by edges
until, the largest frequent subgraphs are detected .
Potential: to be extended to unconnected patterns, which could be more meaningful in certain
applications. Canonical form can be designed on AM.
Algorithm2: E-BFT
step 1: Find all frequent edges and the supporting graph list
Step 2: Find all edge pairs that share one end, pruned by the intersection list of the supporting
transaction lists. For each hinged edge pairs, verify the support and then keep or discard.
Step 3: Grow patterns by joining two k-graphs, pruned by the intersection list , level by level,
until the largest frequent subgraphs are detected.
- Challenges: no efficient canonical form be extended to unconnected patterns.
References
Huan, J., Wang, W., Prins, J., & Yang, J. (2004, August). Spin: mining maximal frequent subgraphs from
graph databases. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining (pp. 581-586). ACM.
Song, Y., & Chen, S. S. (2006, May). Item sets based graph mining algorithm and application in genetic
regulatory networks. In Granular Computing, 2006 IEEE International Conference on (pp. 337-340). IEEE.
Koyutürk, M., Grama, A., & Szpankowski, W. (2004). An efficient algorithm for detecting frequent subgraphs
in biological networks. Bioinformatics, 20(suppl 1), i200-i207.
Song, Y., & Chen, S. S. (2006, May). Item sets based graph mining algorithm and application in genetic
regulatory networks. In Granular Computing, 2006 IEEE International Conference on (pp. 337-340). IEEE.
Wikipedia
Baeza-Yates, R. (2004, January). A fast set intersection algorithm for sorted sequences. In Combinatorial
Pattern Matching (pp. 400-408). Springer Berlin Heidelberg.
Zhang, S. Du, Z. and Wang, J. (2014) New Techniques for Mining Frequent Patterns in Unordered Trees,
IEEE Transactions on Cybernetics.
Zhang, S., & Wang, J. T. L. (2008). Discovering frequent agreement subtrees from phylogenetic
data. Knowledge and Data Engineering, IEEE Transactions on, 20(1), 68-82.
Acknowledgements
Jason Wang, Professor of Computer Science and Bioinformatics, NJIT for providing sample datasets and
Collaborating on the research.

Download Report

Some Preliminary Results of a Graph Mining Algorithm Design

Paperzz.com

Your Paperzz