Scalable Graph Clustering using Stochastic Flows

Scalable Graph Clustering using Stochastic Flows
Applications to Community Discovery
Venu Satuluri and Srinivasan Parthasarathy
Data Mining Research Laboratory
Dept. of Computer Science and Engineering
The Ohio State University
http://www.cse.ohio-state.edu/dmrl
KDD 2009
Outline
• Introduction
- Problem Statement
- Markov Clustering (MCL)
• Proposed Algorithms
- Regularized MCL (R-MCL)
- Multi-level Regularized MCL (MLR-MCL)
• Evaluation
• Conclusions
2
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
Problem Statement
Graph Clustering:
Partition the vertices of a graph into disjoint sets such
that each partition is a well-connected/coherent
group.
Applications:
• Discovery of protein complexes [Snel ‘02]
• Community discovery in social networks
[Newman ‘06]
•Image segmentation [Shi ‘00]
Existing solutions:
• Spectral methods [Shi ‘00]
• Edge-based agglomerative/divisive methods [Newman ‘04]
• Kernel K-Means [Dhillon ‘07]
• Metis [Karypis ’98]
• Markov Clustering (MCL) [van Dongen ’00]
3
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
Markov Clustering (MCL) [van Dongen ‘00]
The original algorithm for clustering graphs using stochastic
flows.
Advantages:
• Simple and elegant.
• Widely used in Bioinformatics because of its noise tolerance
and effectiveness.
Disadvantages:
• Very slow.
- Takes 1.2 hours to cluster a 76K node social network.
• Prone to output too many clusters.
- Produces 1416 clusters on a 4741 node PPI network.
Can we redress the disadvantages of MCL while retaining
its advantages?
4
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
Terminology
• Flow: Transition probability from a node to another node.
• Flow matrix: Matrix with the flows among all nodes; ith
column represents flows out of ith node. Each column sums
to 1.
1
2
0.5
1
3
0.5
2
1
Flow
3
1
Matrix
1
2
3
1
0
0.5
0
2
1.0
0
1.0
3
0
0.5
0
5
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
The MCL algorithm
Input: A, Adjacency matrix
Initialize M to MG, the canonical
transition matrix M:= MG:= (A+I) D-1
Enhances flow to well-connected nodes
as well as to new nodes.
Expand: M := M*M
Inflate: M := M.^r (r usually
2), renormalize columns
Increases inequality in each column.
“Rich get richer, poor get poorer.”
Prune
Converged?
No
Saves memory by removing entries close
to zero.
Yes
Output clusters
6
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
The Regularize operator
Why does MCL output many clusters?
The original matrix is only used at the start, and neighboring
information fades as time goes on.
Called “overfitting”; it does not penalize divergence of flows
between neighbors.
Remedy: Let qi, i=1:k, be the flow distributions of the k neighbors
of node q in the graph. Let wi, i=1:k, be the respective normalized
edge weights, flow of q:
Closed solution:
This update defines the Regularize operator. In matrix notation,
Regularize(M) := M*MG
= M*(A+I)D-1
7
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
The Regularized-MCL algorithm
Input: A, Adjacency matrix
Initialize M to MG, the canonical
transition matrix M:= MG:= (A+I) D-1
Takes into account flows of the
neighbors.
Regularize: M := M*MG
Inflate: M := M.^r (r usually
2), renormalize columns
Increases inequality in each column.
“Rich get richer, poor get poorer.”
Prune
Converged?
No
Saves memory by removing entries close
to zero.
Yes
Output clusters
8
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
Multi-level Regularized MCL
Run R-MCL to convergence, output clusters.
Input Graph
Input Graph
Coarsen
Run Curtailed R-MCL,project flow.
Intermediate
Graph
Coarsen
Intermediate
Graph
...
Coarsen
...
Initializes flow
matrix of refined
graph
Run Curtailed R-MCL, project flow.
Captures global
topology of graph
Coarsest Graph
Faster to run on
smaller graphs first
9
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
Coarsening operation
• Construct a matching: defined as a set of edges, no
vertex is shared among these edges.
• Each edge is mapped into a super-node in the
coarsened graph, and the new edges are the union of
the original ones.
• Two maps used to keep the track of the process
1
2
4
3
6
matching
5
1
4
2
3
5
6
mapping
A
B
C
Map1: A B C
Map1: 1, 2, 5
Map1: 4, 3, 6
10
Project flow
11
Evaluation criteria
• The normalized cut of a cluster C in the graph
G is defined as:
• Average Ncut:
Avg.
/k
12
Comparison with MCL
• Why R-MCL is much faster than MCL?
– Regularize is more faster that expansion, because MG
is sparser than M, and R-MCL can stop earlier
• It seems MLR-MCL only upgrades the
performance very less, especially the AVG. N-cut
13
Comparison with Graclus and Metis
Quality: MLR-MCL improves upon both Graclus and Metis
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
14
Comparison with Graclus and Metis
Speed: MLR-MCL is faster than Graclus and competitive with Metis
15
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
Evaluation on PPI networks
Yeast PPI network with 4741 proteins and 15148 interactions.
Annotations from the Gene Ontology database used as ground truth.
MLR-MCL returns clusters of
higher biological significance
than MCL or Graclus.
16
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
Conclusions
• Regularized MCL overcomes the fragmentation problem of
MCL.
• Multi-level Regularized MCL further improves quality and
speed of R-MCL.
• MLR-MCL often outperforms state-of-the-art algorithms, both
quality and speed-wise, on a wide variety of real datasets.
Future Directions:
• Novel coarsening strategies
• Extensions to directed and bi-partite graphs.
Acknowledgements:
This work is supported in part by the following grants: NSF CAREER
IIS-0347662, RI-CNS-0403342, CCF-0702586 and IIS-0742999
17
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows
References:
1. MCL - Graph Clustering by Flow Simulation. S. van Dongen, Ph.D. thesis,
University of Utrecht, 2000.
2. Graclus - Weighted Graph Cuts without Eigenvectors: A Multilevel Approach.
Dhillon et. al., IEEE. Trans. PAMI, 2007.
3. Metis - A fast and high quality multilevel scheme for partitioning irregular
graphs. Karypis and Kumar, SIAM J. on Scientific Computing, 1998
4. Normalized Cuts and Image Segmentation. Shi and Malik, IEEE. Trans. PAMI,
2000.
5. Finding and evaluating community structure in networks. Newman and Girvan,
Phys. Rev. E 69, 2004.
6. The identification of functional modules from the genomic association of genes.
Snel et. al., PNAS 2002.
Thank You!
18
Venu Satuluri and Srinivasan Parthasarathy
Scalable Graph Clustering using Stochastic Flows