Jure Leskovec, CMU Kevin Lang, Anirban Dasgupta, Michael Mahoney Yahoo! Research Big data Study emerging behaviors How are small networks different from large 2 Communities (groups, clusters, modules): Sets of nodes with lots of connections inside and few to outside (the rest of the network) Communities, clusters, groups, modules 3 Nodes represent proteins Edges represent interactions/associations Proteins with same function interact more Can use network to discover functional groups Yeast transcriptional regulatory modules [Bar-Joseph et al., 2003] 4 Clusters correspond to social communities, organizational units (e.g., departments) Zachary’s Karate club network • During the study the club split into 2 • The split corresponds to min-cut (● vs. ■) 5 Democrat vs. Republican blogs [Adamic-Glance 2005] 6 Citations Collaborations [Newman 2003] 7 Nested communities: modular structure of networks is hierarchically organized University Arts Science CS Math Drama Music 8 Recursive hierarchical network (a) N=5, E=8 (b) N=25, E=56 (c) N=125, E=344 9 Intuition: Find nodes that can be easily separated from the rest of the network Various objective functions Min-cut Normalized-cut Centrality, Modularity Various algorithms Spectral clustering (random walks) Girvan-Newman (centrality) Metis (contraction based) Girvan-Newman: 1) Betweenness centrality: number of shortest paths passing through an edge. 2) Remove edges by decreasing centrality 10 11 Statistical properties of community structure Instead of searching for communities we measure well how expressed are communities Questions What is the community structure of real world networks? How to measure and quantify this? What does this tell us about network structure? What is a good model (intuition)? What are consequences for clustering/partitioning algorithms? 12 How community like is a set of nodes? Need a natural intuitive measure Conductance (normalized cut) Φ(S) = # edges cut / # edges inside Small Φ(S) corresponds to more community-like sets of nodes S S’ 13 What is “best” community of 5 nodes? Score: Φ(S) = # edges cut / # edges inside 14 What is “best” community of 5 nodes? Bad community Φ=5/6 = 0.83 Score: Φ(S) = # edges cut / # edges inside 15 What is “best” community of 5 nodes? Bad community Φ=5/7 = 0.7 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside 16 What is “best” community of 5 nodes? Bad community Φ=5/7 = 0.7 Best community Φ=2/8 = 0.25 Better community Φ=2/5 = 0.4 Score: Φ(S) = # edges cut / # edges inside 17 We define: Network community profile (NCP) plot Plot the score of best community of size k Search over all subsets of size k and find best: Φ(k=5) = 0.25 NCP plot is intractable to compute 18 We define: Network community profile (NCP) plot Plot the score of best community of size k log Φ(k) k=5, Φ(k)=0.25 k=7, Φ(k)=0.18 Community size, log k 19 Community size, log k 20 Community score, log Φ(k) Local spectral clustering algorithm Pick a seed node Slowly diffuse mass around it (via PageRank like random walk) Find the bottleneck Repeat many times Many seed nodes for very local walks Less seed nodes for more global (longer) walks 21 22 Dolphin social network Two communities of dolphins Network NCP plot 23 Zachary’s university karate club social network During the study club split into 2 The split (squares vs. circles) corresponds to cut B Network NCP plot 24 Collaborations between scientists in Networks Network NCP plot 25 Network NCP plot 26 Network NCP plot 27 Manifold learning dataset (Hands) Network NCP plot 28 Eastern US power grid: 29 – Small social networks – Geometric and – Hierarchical network have downward NCP plot Network What about large networks? NCP plot 30 31 Previously researchers examined community structure of small networks (~100 nodes) We examined more than 70 different large networks Large real-world networks look very different! 32 Typical example: General relativity collaboration network (4,158 nodes, 13,422 edges) 33 Community score Better and better communities Best communities get worse and worse Best community has 100 nodes Community size 34 Whiskers are responsible for downward slope of NCP plot NCP plot Largest whisker Whisker is a set of nodes connected to the network by a single edge 35 Each new edge inside the community costs more NCP plot Φ=1/3 = 0.33 Φ=2/4 = 0.5 Φ=8/6 = 1.3 Φ=64/14 = 4.5 Each node has twice as many children 36 Take a real network G Rewire edges for a long time We obtain a random graph with same degree distribution as the real network G 37 Rewired network: random network with same degree distribution 38 Whiskers in real networks are larger than expected 39 Edge to cut Whiskers in real networks are non-trivial (richer than trees) 40 What if we allow cuts that give disconnected communities? • Cut all whiskers • Compose communities out of whiskers • How good “communities” do we get? 41 Community score We get better community scores when composing disconnected sets of whiskers Connected communities Bag of whiskers Community size 42 Nothing happens! Now we have 2-edge connected whiskers to deal with. 43 Rewired network Connected communities Bag of whiskers 44 Denser and denser core of the network Core contains 60% node and 80% edges Whiskers are responsible for good communities Network structure: Core-periphery (jellyfish, octopus) 45 46 (Sparse) Random graph: Start with N nodes Pick pairs of nodes uniformly at random and connect Theorem (works for any degree distribution) Flat (long random connections) Sparsity does not explain our observation 47 Preferential attachment [Price 1965, Albert & Barabasi 1999]: Add a new node, create m out-links Probability of linking a node ki is proportional to its degree Based on Herbert Simon’s result Power-laws arise from “Rich get richer” (cumulative advantage) Flat (connections to hubs – no locality) 48 Let’s exploit local connections Down (locally network looks like a mesh) and Flat (at large scale network looks random) 49 Geometric preferential attachment: Place nodes at random in 2D Pick a node Pick nodes in a radius Connect preferentially Flat (locally network is random) and Down (globally network is a mesh – union of local expanders) 50 Forest Fire: connections spread like a fire New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively As community grows it blends into the core of the network 51 rewired network Bag of whiskers 52 Whiskers: Largest whisker has ~100 nodes Independent of network size Dunbar number: a person can maintain social relationship to at most 150 people Core: Core has little structure (hard to cut) Still more structure than the random network 53 Other researchers examined small networks so they did not hit the Dunbar’s limit Small evidence: 400k nodes Amazon co-purchasing network [Clauset et al. 2004] ▪ Largest community has 50% of all nodes ▪ It was labeled “Miscelaneous” Karate club has no significant community structure [Newman et al. 2007] 54 Bond vs. identity communities Multiple hierarchies that blur the community boundaries 55 Ground truth Yes, use attributes, better link semantics 56 NCP plot is a way to analyze network community structure Our results agree with previous work on small networks (that are commonly used for testing community finding algorithms) But large networks are different Large networks Whiskers + Core structure Small well isolated communities blend into the core of the networks as they grow 57
© Copyright 2026 Paperzz