Statistical properties of network community structure

Jure Leskovec, CMU
Kevin Lang, Anirban Dasgupta, Michael Mahoney
Yahoo! Research



Big data
Study emerging behaviors
How are small networks different from large
2

Communities (groups,
clusters, modules):
 Sets of nodes with lots
of connections inside
and few to outside (the
rest of the network)
Communities, clusters,
groups, modules
3


Nodes represent proteins
Edges represent
interactions/associations

Proteins with same
function interact more

Can use network to
discover functional
groups
Yeast transcriptional regulatory modules [Bar-Joseph et al., 2003]
4

Clusters correspond to social communities,
organizational units (e.g., departments)
Zachary’s Karate club network
• During the study the club split into 2
• The split corresponds to min-cut (● vs. ■)
5
Democrat vs. Republican blogs
[Adamic-Glance 2005]
6
Citations
Collaborations
[Newman 2003]
7

Nested communities: modular structure of
networks is hierarchically organized
University
Arts
Science
CS
Math
Drama
Music
8

Recursive hierarchical network
(a) N=5, E=8
(b) N=25, E=56
(c) N=125, E=344
9

Intuition: Find nodes that can be easily
separated from the rest of the network

Various objective functions
 Min-cut
 Normalized-cut
 Centrality, Modularity

Various algorithms
 Spectral clustering (random walks)
 Girvan-Newman (centrality)
 Metis (contraction based)
Girvan-Newman:
1) Betweenness centrality:
number of shortest paths
passing through an edge.
2) Remove edges by
decreasing centrality
10
11
Statistical properties of community structure
 Instead of searching for communities we
measure well how expressed are communities
Questions
 What is the community structure of real world
networks?
 How to measure and quantify this?
 What does this tell us about network structure?
 What is a good model (intuition)?
 What are consequences for
clustering/partitioning algorithms?
12

How community like is a
set of nodes?

Need a natural intuitive
measure

Conductance (normalized cut)
Φ(S) = # edges cut / # edges inside

Small Φ(S) corresponds to more
community-like sets of nodes
S
S’
13
What is “best”
community of
5 nodes?
Score: Φ(S) = # edges cut / # edges inside
14
What is “best”
community of
5 nodes?
Bad
community
Φ=5/6 = 0.83
Score: Φ(S) = # edges cut / # edges inside
15
What is “best”
community of
5 nodes?
Bad
community
Φ=5/7 = 0.7
Better
community
Φ=2/5 = 0.4
Score: Φ(S) = # edges cut / # edges inside
16
What is “best”
community of
5 nodes?
Bad
community
Φ=5/7 = 0.7
Best
community
Φ=2/8 = 0.25
Better
community
Φ=2/5 = 0.4
Score: Φ(S) = # edges cut / # edges inside
17

We define:
Network community profile (NCP) plot
Plot the score of best community of size k


Search over all subsets of size k
and find best: Φ(k=5) = 0.25
NCP plot is intractable to
compute
18

We define:
Network community profile (NCP) plot
Plot the score of best community of size k
log Φ(k)
k=5, Φ(k)=0.25
k=7, Φ(k)=0.18
Community size, log k
19
Community size, log k
20
Community score, log Φ(k)

Local spectral clustering
algorithm
 Pick a seed node
 Slowly diffuse mass
around it (via PageRank
like random walk)
 Find the bottleneck
Repeat many times
Many seed nodes for
very local walks
 Less seed nodes for more
global (longer) walks


21
22

Dolphin social network
 Two communities of dolphins
Network
NCP plot
23

Zachary’s university karate club social network
 During the study club split into 2
 The split (squares vs. circles) corresponds to cut B
Network
NCP plot
24

Collaborations between scientists in Networks
Network
NCP plot
25
Network
NCP plot
26
Network
NCP plot
27

Manifold learning dataset (Hands)
Network
NCP plot
28

Eastern US power grid:
29
– Small social networks
– Geometric and
– Hierarchical network
have downward NCP plot
Network
What about large
networks?
NCP plot
30
31

Previously researchers examined community
structure of small networks (~100 nodes)

We examined more than 70 different large
networks
Large real-world
networks look very
different!
32

Typical example:
General relativity collaboration network
(4,158 nodes, 13,422 edges)
33
Community score
Better and better
communities
Best communities
get worse and worse
Best community
has 100 nodes
Community size
34

Whiskers are responsible for
downward slope of NCP plot
NCP plot
Largest
whisker
Whisker is a set of
nodes connected
to the network by
a single edge
35

Each new edge inside the
community costs more
NCP plot
Φ=1/3 = 0.33
Φ=2/4 = 0.5
Φ=8/6 = 1.3
Φ=64/14 = 4.5
Each node has twice
as many children 36



Take a real network G
Rewire edges for a long time
We obtain a random graph with same degree
distribution as the real network G
37
Rewired network:
random network
with same degree
distribution
38
Whiskers in real
networks are larger
than expected
39
Edge to cut
Whiskers in real networks are
non-trivial (richer than trees)
40
What if we allow cuts
that give disconnected
communities?
• Cut all whiskers
• Compose communities out of whiskers
• How good “communities” do we get?
41
Community score
We get better community
scores when composing
disconnected sets of whiskers
Connected
communities
Bag of
whiskers
Community size
42
Nothing happens!
Now we have 2-edge connected
whiskers to deal with.
43
Rewired network
Connected
communities
Bag of
whiskers
44
Denser and
denser core of
the network
Core contains
60% node and
80% edges
Whiskers are
responsible for
good communities
Network structure:
Core-periphery
(jellyfish, octopus)
45
46

(Sparse) Random graph:
 Start with N nodes
 Pick pairs of nodes uniformly at random and connect
Theorem (works for any degree
distribution)
Flat (long random
connections)
Sparsity does not explain our observation
47

Preferential attachment [Price 1965, Albert & Barabasi 1999]:
 Add a new node, create m out-links
 Probability of linking a node ki is
proportional to its degree

Based on Herbert Simon’s result
 Power-laws arise from “Rich get richer” (cumulative advantage)
Flat (connections to
hubs – no locality)
48

Let’s exploit local connections
Down (locally network looks like a mesh)
and Flat (at large scale network looks random)
49

Geometric preferential attachment:




Place nodes at random in 2D
Pick a node
Pick nodes in a radius
Connect preferentially
Flat (locally network is random)
and Down (globally network is a
mesh – union of local expanders)
50

Forest Fire: connections spread like a fire




New node joins the network
Selects a seed node
Connects to some of its neighbors
Continue recursively
As community
grows it blends
into the core of
the network
51
rewired
network
Bag of
whiskers
52

Whiskers:
 Largest whisker has ~100 nodes
 Independent of network size
 Dunbar number: a person can maintain social
relationship to at most 150 people

Core:
 Core has little structure (hard to cut)
 Still more structure than the random network
53


Other researchers examined small networks
so they did not hit the Dunbar’s limit
Small evidence:
 400k nodes Amazon co-purchasing network
[Clauset et al. 2004]
▪ Largest community has 50% of all nodes
▪ It was labeled “Miscelaneous”
 Karate club has no significant community structure
[Newman et al. 2007]
54


Bond vs. identity communities
Multiple hierarchies that blur the community
boundaries
55


Ground truth
Yes, use attributes, better link semantics
56




NCP plot is a way to analyze network
community structure
Our results agree with previous work on small
networks (that are commonly used for testing
community finding algorithms)
But large networks are different
Large networks
 Whiskers + Core structure
 Small well isolated communities blend into the
core of the networks as they grow
57