PPT - L. Karl Branting

Overcoming Resolution
Limits in MDL Community
Detection
L. Karl Branting
The MITRE Corporation
2nd SNA-KDD Workshop 24 Aug 2008
Outline



Utility functions in community detection
Resolution limits
MDL-based community detection
– Previous: RB and AP
– New: SGE


Experimental Evaluation
Lessons
2
2nd SNA-KDD Workshop 24 Aug 2008
Utility functions in community detection

Two components of community detection algorithms
– Utility function – quality criterion to be optimized
– Search strategy – procedure for finding optimal partition

Examples
– Garvin & Newman (2003)

Utility function: modularity

Search strategy: greedy divisive hierarchical clustering (iteratively
remove highest betweenness edge)
– Newman (2003)

Utility function: modularity

Search strategy: greedy agglomerative hierarchical clustering
(iteratively choose highest modularity merge)
– Tasgin & Bingol (2006)

Utility function: modularity

Search strategy: genetic algorithm
3
2nd SNA-KDD Workshop 24 Aug 2008
Utility functions in community detection

Other search strategies used with modularity
– Rattigan, Maier, Jensen (2007)

Utility function: modularity

Search strategy: Greedy divisive hierarchical clustering using a
Network Structured Index to approximation edge betweenness
– Donetti & Munoz (2004)

Utility function: modularity

Search strategy: greedy agglomerative hierarchical clustering with
spectral division
4
2nd SNA-KDD Workshop 24 Aug 2008
Utility functions in community detection

Statistical Approaches
– Zhang, Qiu, Giles, Foley, & Yen (2007)


Utility function: log-likelihood (LDA parameters)

Search strategy: fixed-point iteration
Compression-Based Approaches
– Rosvall & Bergstrom (2007)

Utility function: Minimum Description Length

Search strategy: simulated annealing
– Chakrabarti (2004)


Utility function: Minimum Description Length

Search strategy: exhaustive search for k, hill-climbing given k
Utility function implicit in search strategy
– Raghavan, Albert, & Kumara (2007) – marker passing
– Cliques, cores, etc.
5
2nd SNA-KDD Workshop 24 Aug 2008
Modularity
 w( Dii )  li  2 

  

 l
 l  
1 i  m 
– W(Dii) = number of edges internal to group i
– li = number of edges incident to vertices in
group I
– l = total number of edges



Intuitive – expresses intuition that ratio of
internal to external edges is greater for
groups than for non-groups
Popular
Imperfect
– Fortunato & Barthelemy (2007) Resolution
limit: groups conflated if number of vertices
less than 2l
– Rosvall & Bergstrom (2007) Biased towards
same-sized groups
6
2nd SNA-KDD Workshop 24 Aug 2008
Resolution Limit

Ring graph R15,4
– 15 communities
– 4 nodes per
community

7
Community structure
that maximizes
modularity conflates
groups
2nd SNA-KDD Workshop 24 Aug 2008
Approaches to modularity’s resolution
limit



Apply recursively to large communities (Ruan & Zhang
2007)
Apply locally (Clauset 2005)
Choose a different utility function
8
2nd SNA-KDD Workshop 24 Aug 2008
Description Length

Utility of community structure is sum of bits needed to
represent
– Community structure +
– Graph given community structure


Search strategy attempts to minimize description length
There is no unique bit count
– Undecidability of Kolmogorov complexity

Previous approaches
– Rosvall & Bergstrom (2007): RB

Handles group size skew better than modularity
– Chakrabarti (2004): AP
– Comparison

Similar breakdown of bits

Different calculation
9
2nd SNA-KDD Workshop 24 Aug 2008
Components of Description

Components (details in paper)
1. Bits to represent number of nodes in graph

2.
3.
4.
5.

ignored because not specific to community structure
Bits to represent number of groups
Bits to represent mapping between nodes and groups
Bits needed for number of group-to-group edges
Bits needed for adjacencies between nodes
Purpose
– 2, 3, 4: represent group structure
– 1, 5: represent graph as a whole
10
2nd SNA-KDD Workshop 24 Aug 2008
Surprising Experimental Result

RB, AP, and modularity compared
as utility functions
– Applied to ring graphs Rm,c for 4 ≤
m ≤ 16 and 3 ≤ c ≤ 9
– Search strategy: greedy divisive
hierarchical clustering (iteratively
remove highest betweenness
edge)

Unsurprising result. Modularity
led to conflated groups for:
–
–
–
–

m > 8 and c = 3
m > 10 and c = 4
m > 11 and c = 5
m > 13 and c = 6,7
Surprising result.
– Both RB and AP conflated at least
one pair of groups in every Rm,c!
11
2nd SNA-KDD Workshop 24 Aug 2008
Hypothesis


Both RB and AP require at least one bit per pair of groups
in term 4
Perhaps this estimation causes group conflation
– Term 4 grows as the square of the number of groups
– If graph is sparse, conflating groups may save more in term 4
reduction than it costs in term 5 increase
Components
1. Bits to represent number of nodes in graph

2.
3.
4.
5.
ignored because not specific to community structure
Bits to represent number of groups
Bits to represent mapping between nodes and groups
Bits needed for number of group-to-group edges
Bits needed for adjacencies between nodes
12
2nd SNA-KDD Workshop 24 Aug 2008
SGE (Sparse Graph Encoding)

Components
1. Bits to represent number of nodes in graph

Ignored, as in RB and AP
2. Bits to represent number of groups

Follows RB
3. Bits to represent mapping between nodes and groups

Similar to AP
4. Bits needed for number of group to group edges


Split into 2 terms
-
Which pairs of groups are connected (much less than one bit per pair if
pairs sparsely or densely connected)
-
Number of edges between connected groups
Grows as number of connected pairs, not total number of pairs
5. Bits needed for adjacencies between nodes

Follows RB
13
2nd SNA-KDD Workshop 24 Aug 2008
Performance of SGE on Ring Graphs

Correct community structure found for every Rm,c for 4 ≤
m ≤ 16 and 3 ≤ c ≤ 9 except
– R4,3
– R13,3


Results confirm hypothesis that resolution limit in RB and
AP is result of over-counting term 4: the bits needed for
group-to-group edges
Significance
– Ring graphs rare in real world
– How does SGE compare on more realistic graphs?
14
2nd SNA-KDD Workshop 24 Aug 2008
Uniform random graph


Similar to graphs in
Rosvall & Bergstrom
(2007)
Test set
–
–
–
–
32 vertices
4 groups
average degree 6
size ratio
{1.0,1.25,1.5,1.75,2.0}
– Proportion internal
edges {0.6,0.75,0.9}

Example:
–
–
–
–
–
15
32 vertices
4 groups
average degree 6
size ratio 1.25
Proportion internal
edges
2nd 0.67
SNA-KDD Workshop 24 Aug 2008
Embedded Barabasi-Albert Graphs

Test set
– 4 communities
separately
generated by
preferential
attachment
– In each community


4 initial vertices

2-4 edges added
per time step

20 time steps
Example
– 4 communities
– 4 initial vertices
– 3 edges added per
time step
– 20 time steps
16
2nd SNA-KDD Workshop 24 Aug 2008
Evaluation Criteria

Rand index (Rand 1971)
Adjusted Rand index (Hubert & Arabie
1985)


F-measure – based on same-cluster pairs
– Recall =
| proposedPairs  actualPairs |
| actualPairs |
– Precision =
– F-measure =
| proposedPairs  actualPairs |
| proposedPairs |
2 * recall * precision
recall  precision
17
2nd SNA-KDD Workshop 24 Aug 2008
Results: Uniform random graph
18
2nd SNA-KDD Workshop 24 Aug 2008
Results: Uniform random graph
19
2nd SNA-KDD Workshop 24 Aug 2008
Results: Uniform random graph
20
2nd SNA-KDD Workshop 24 Aug 2008
Results: Embedded Barabasi-Albert
21
2nd SNA-KDD Workshop 24 Aug 2008
Summary of Evaluation

Random graphs
– Community structure is weak


Group sizes are balanced – modularity is best
Group sizes are imbalanced – RS is best (as per Rosvall &
Bergstrom 2007)
– Community structure is strong



Group sizes are balanced – not much difference
Group sizes are imbalanced – modularity is particularly bad (as per
Rosvall & Bergstrom 2007), SGE slightly better than RS and AP
EBA graphs
– Sparse – AP and SGE weaker than modularity and RS
– Dense – essentially identical accuracy
22
2nd SNA-KDD Workshop 24 Aug 2008
Conclusion

Narrow
– Conflation of groups by MDL in sparse graphs (e.g., ring
graphs) can be avoided by adjusting group-to-group edge
counts.
– This change doesn’t hurt performance in more common types
of graphs.
– Compression-based clustering works well, but requires
tinkering
– Modularity detects weak structure well when graph not too big
and groups not too imbalanced

Broad
– Still unclear what utility function is best overall
– Needed: theory relating graph typology to utility functions
23
2nd SNA-KDD Workshop 24 Aug 2008