Branching Out: Quantifying Tree-like Structure in Complex Networks

Branching Out:
Quantifying Tree-like
Structure in Complex
Networks
Blair D. Sullivan
Complex Systems Group
Center for Engineering Science Advanced Research
Computer Science and Mathematics Division
Oak Ridge National Laboratory
MMDS, July 12, 2012
Joint work with Michael Mahoney &
Aaron Adcock, Stanford University
A partial map of the Internet, January 15 2005
Motivation
• Large networks are becoming ubiquitous in many
domains – e.g. biology, physics, chemistry,
infrastructure, communications, and sociology
• Many methods to understand structure at very largescale (diameter), small-scale (clustering coefficient);
very few to probe intermediate scale (clusters of size
5K in a 5M node network). Can we get good tools to
understand and exploit this?
The US electric transmission system.
Courtesy North American Reliability Corporation.
2
Managed by UT-Battelle
for the U.S. Department of Energy
Drug-Target Network.
Nature Biotechnology 25(10), October 2007
Intermediate-Scale Structure
Ising model (ferromagnetism):
Temperature parameter controls scale of local
correlations between magnetic spins.
3
Managed by UT-Battelle
for the U.S. Department of Energy
Intermediate-Scale Structure
Ising model (ferromagnetism):
Temperature parameter controls scale of local
correlations between magnetic spins.
The “intermediate-scale structure” is the coupling of local & global properties.
• Determines network evolution & dynamics of diffusion, other processes
• Implicitly affects applicability of common data analysis tools
• This is where all the “interesting stuff” happens.
4
Managed by UT-Battelle
for the U.S. Department of Energy
Prior empirical evidence
Claim: Many large complex networks are “tree-like” when viewed at
intermediate scales:
• The Unreasonable Effectiveness of Tree-Based Theory for Networks with
Clustering, Melnik, Hackett, Porter, Mucha, Gleeson. Physical Review E, Vol. 83,
No. 3 (2010).
• Finding Hierarchy in Directed Online Social Networks, Gupta, Shankar, Li,
Muthukrishnan, Iftode. WWW2011.
• "It was noted in recent years that the Internet structure has a highly connected
core and long stretched tendrils, and that most of the routing paths between
nodes in the tendrils pass through the core. Therefore, we suggest in this work, to
embed the Internet distance metric in a hyperbolic space where routes are bent
toward the center“ Shavitt, Tankel. 2008. Hyperbolic embedding of internet graph
for distance estimation and overlay construction. IEEE/ACM Trans. Netw. 16, 1
(2008).
However, no consensus has been reached on defining and measuring this treelike structure, making it difficult to exploit algorithmically.
5
Managed by UT-Battelle
for the U.S. Department of Energy
Image credit: Munzer et al
Prior empirical evidence
Claim: Many large complex networks are “tree-like” when viewed at
intermediate scales:
• The Unreasonable Effectiveness of Tree-Based Theory for Networks with
Clustering, Melnik, Hackett, Porter, Mucha, Gleeson. Physical Review E, Vol. 83,
No. 3 (2010).
• Finding Hierarchy in Directed Online Social Networks, Gupta, Shankar, Li,
Muthukrishnan, Iftode. WWW2011.
• "It was noted in recent years that the Internet structure has a highly connected
core and long stretched tendrils, and that most of the routing paths between
nodes in the tendrils pass through the core. Therefore, we suggest in this work, to
embed the Internet distance metric in a hyperbolic space where routes are bent
toward the center“ Shavitt, Tankel. 2008. Hyperbolic embedding of internet graph
for distance estimation and overlay construction. IEEE/ACM Trans. Netw. 16, 1
(2008).
However, no consensus has been reached on defining and measuring this treelike structure, making it difficult to exploit algorithmically.
6
Managed by UT-Battelle
for the U.S. Department of Energy
What do you mean, “tree-like”?
Facebook: Caltech Network
Arxiv GR-QC collaboration
Image credit: Traub, Kelsic, Mucha, Porter
Image credit: Tim Davis
Autonomous
Systems
7
Managed by UT-Battelle
for the U.S. Department of Energy
Image credit: Graphics@Illinois
Hyperbolic Space
• Multiple parallel lines pass through a point,
and angles in a triangle sum to less than 180.
• At right, see a {7,3}-tessellation of the
hyperbolic plane by equilateral triangles,
and the dual {3,7}-tessellation by regular
heptagons. All triangles and heptagons are
of the same hyperbolic size but the size of
their Euclidean representations
exponentially decreases as a function of the
distance from the center, while their number
exponentially increases.
• In Euclidean space, a circle’s area grows
polynomially with its diameter; in hyperbolic
space, it grows exponentially. Think of
growth as in a binary tree.
• The shortest paths in hyperbolic spaces are
arcs through disk, not paths around the
exterior (much like travel in a rooted tree)
8
Managed by UT-Battelle
for the U.S. Department of Energy
Image credit Krioukov et al.
Hyperbolic Embedding and Greedy Routing
• Hyperbolic space gives us “extra
room” to embed networks (as
opposed to Euclidean space).
• A number of algorithms take
advantage of this to devise
greedy routing schemes
• Kleinberg uses a minimum
spanning tree, embedded as a
subset of a d-regular tree, where
d is the maximum degree of the
MST (d = 4 is shown at right)
Image credit Kleinberg
9
Managed by UT-Battelle
for the U.S. Department of Energy
So is it good or bad?
10
Managed by UT-Battelle
for the U.S. Department of Energy
Image credit M.C.Escher
A generative model
• Three-parameter model introduced by Krioukov et Temp.
al uses an underlying hyperbolic geometry and
Curv.
allows us to vary the curvature, degree
heterogeneity, and density. (Physicists: this is
basically fermions)
Finite
• Idea: place nodes in the hyperbolic plane (Poincare
disk) and connect them with a probability which is
dependent on their hyperbolic distance.
• Knob 1: Power law exponent: determines
distribution of nodes in the disk – the higher the
exponent, the more nodes go towards the center.
This determines the curvature (and degree
heterogeneity)
• Knob 2: Temperature: determines how much we
ignore the underlying geometry in adding edge;
at high temperatures, edge connections become
essential random (independent of distance).
• Knob 3: Average degree (target): approximately
allows control over density
11
Managed by UT-Battelle
for the U.S. Department of Energy
Infinite
Finite
Infinite
Random
hyperbolic
graphs
Classical
random
graphs
(Erdos-Renyi)
Random
geometric
graphs
Random
graphs
w/given
expected
deg.
Our test parameters
Power Law
2.1
2.25
2.5
Temperature 20
1.5
0.5
Avg. Degree
10
20
5
Special Thanks
Image credit San Diego Reader
Special thanks to D. Krioukov for providing us code to generate networks
according to the model described on the previous slide.
12
Managed by UT-Battelle
for the U.S. Department of Energy
Hyperbolic Embedding for Inference
• Boguna, Krioukov, Papadopolous have mapped “the internet” to hyperbolic
space, and used the embedding to identify community structure (and offer
suggested routing schemes).
• Their methods rely on iterative
MLE methods, and do not
seem to be scalable to
examine “big data”.
13
Managed by UT-Battelle
for the U.S. Department of Energy
Image credit Boguna, Krioukov, Papadopolous
A geometric measure of tree-likeness
• Gromov’s δ-hyperbolicity arises from the geometry of metric spaces and δ
measures the extent to which a (geodesic) metric space embeds in a tree metric.
u
v
w
x
δ=0
d(u,v) + d(w,x) = 1 + 1 = 2
d(u,x) + d(v,w) = 1 + 1 = 2
d(u,w) + d(v,x) = 1 + 1 = 2
u
v
w
x
δ=1
d(u,v) + d(w,x) = 1 + 1 = 2
d(u,x) + d(v,w) = 2 + 2 = 4
d(u,w) + d(v,x) = 1 + 1 = 2
• Note: d(u,v) is the length of the shortest path between u and v in the graph.
• The minimum δ for which G is δ-hyperbolic can be computed (naively) in O(n4)
14
Managed by UT-Battelle
for the U.S. Department of Energy
More on δ-hyperbolicity
• Viewing graphs as a geodesic metric space (replace edges with length 1 segments
intersecting only at endpoints) provides another way to think of δ-hyperbolicity.
• For a geodesic triangle, there is a unique isometry to a tripod so that except for the
leaves , each point on the tripod has two pre-images on the triangle.
Image credit: Chepoi, Dragan et al
Image credit: Bridson, Haefliger
• A triangle is δ-thin if the pre-images of every tripod point have distance at most δ.
• A triangle is δ-slim if each of its sides is contained in the δ -neighborhood of the union
of the other two sides.
• A graph is δ -hyperbolic if all its geodesic triangles are δ -thin (or δ-slim); each results in
a slightly different min δ, related to each other by small constant factors.
15
Managed by UT-Battelle
for the U.S. Department of Energy
Examples: Small world graphs & Ringed Trees
• Kleinberg’s small-world random graphs add
long-range edges with probability proportional
to 1/dB(u,v)p to a d-dimensional grid.
• Mahoney et al (2011) showed even at the
“sweet spot” of p = d, the small-world graphs
are not logarithmically hyperbolic w.h.p. When
p < d, the graphs are not hyperbolic, and for p >
3 and d = 1, the hyperbolic delta is polynomial in
the size of graph.
• Define a ringed tree to be a binary tree
plus edges connecting all vertices at a given
tree level into a ring (quasi-isometric to the
Poincare disk)
• Adding long-range edges between the
leaves of a ringed tree w/ probability
decreasing:
Image credit: Mahoney et al
– exponentially fast with the ring distance
produces logarithmic hyperbolicity
• Replace the ringed tree with a pure binary tree: – as a power-law with the ring distance produces
none of the resulting graphs are hyperbolic.
non-hyperbolic random graphs
16
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical Results: Real Graphs
17
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical Results: “Planar”
• Planar graphs have a very different
distribution of delta over their
quadruples, and very high
diameters.
18
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical Results: “Hyperbolic”?
• Much more subtle differences
when looking at non-planar graphs.
• Density seems to play a role, and
most networks considered had very
low diameter.
19
Managed by UT-Battelle
for the U.S. Department of Energy
Computing δ: Sampling
• Due to high computational complexity, a number of prior works have used
sampling to estimate the hyperbolicity of large networks.
• Some prior work sampled at a rate of about .0002 percent (on their largest data),
and although biased towards pairs at larger distances, this could still easily miss the
maximum delta, which is achieved on a very small (in our example 2 x 10-11 percent)
subset of quadruplets. Note that sampling, however, is likely to be sufficient for
computing average deltas.
• Example below is SNAP graph as20000101 (about 1600 nodes)
delta
0.0:
0.5:
1.0:
1.5:
2.0:
2.5:
Total
20
Fraction of quadruplets:
0.677473774788751
0.313235924997126
0.009262044976055
0.000028008357243
0.000000246259522
# of quadruplets
4577453756970
2116425779202
62580404070
189242691
1663890
0.000000000022835
154
0.999999999401533
6756650846976
Managed by UT-Battelle
for the U.S. Department of Energy
K-core Decompositions
• Given a graph G = (V,E), the k-core of the graph, denoted Hk is the
maximal subgraph H of G so that degH(v) is at least k for all v in H.
•The core number of a
vertex v is defined to be
the maximum k so that
v is in Hk but not Hk+1.
• The set of nodes with
core number k is called
the k-shell of G.
21
Managed by UT-Battelle
for the U.S. Department of Energy
Condensed Matter
Collaboration Network
Image credit: LaNet-vi
Empirical Results: Social Graphs
Facebook-Texas84
~36,000 nodes
~3x10^6 edges
soc-Epinions1
~47,000 nodes
~730,000 edges
22
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical Results: Autonomous Systems
AS19990820
~5,500 nodes
~22,000 edges
AS19990818
~5,500 nodes
~22,000 edges
23
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical Results: Collaboration Graphs
CA-AstroPhysics
~18,000 nodes
~394,000 edges
CA-GrQc
~4,000 nodes
~26,000 edges
24
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical results: Synthetic by power law exponent
25
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical results: Synthetic by temperature
26
Managed by UT-Battelle
for the U.S. Department of Energy
Some (oversimplified) Summary Statistics
ca-AstroPhysics:
Texas84:
• ~0.6% of nodes (113 nodes) in two
deepest cores (k = 55,56)
• ~8% of nodes (≥2400 nodes) in two
deepest cores (k = 80,81)
• ~1.8% of edges (~7,000 edges)
leaving the deepest core (k = 56)
• ~7% of edges (≥220K edges) leaving
the deepest core (k = 81)
• ~1.8% of edges (~7000 edges) leaving
next core (k = 55)
• ~17% of edges (≥510K edges) leaving
the next core (k = 80)
• Max average k-shell change is +12
(out of k = 56 max shell)
• Max average k-shell change is +50
(out of k = 80 max shell)
• Suggests collaborators tend to
collaborate with people of similar
coreness/peripheryness
• Suggests that the “periphery” nodes
are more tightly connected to “corelike” nodes
• “Typical” for collaboration graphs
(and other core-periphery graphs)
• “Typical” for more social graphs (and
Facebook in particular)
27
Managed by UT-Battelle
for the U.S. Department of Energy
A combinatorial measure of tree-likeness
• A tree decomposition of a graph G = (V,E ) is a pair (X={X1, X2, ..., XL}, T) with
Xi a subset of V , and T a tree with nodes {1, …,L} satisfying three conditions:
• The union of the sets in X is equal to V
• For every edge (u,v) in G, {u,v} is a subset of some Xi
• For every v in V, the indices of {Xi} containing V form a sub-tree of T.
• We call the sets Xi the bags of the decomposition and max(| Xi |) the width.
The tree-width of G is the minimum width over all valid tree decompositions.
28
Managed by UT-Battelle
for the U.S. Department of Energy
Understanding FPT: “problems are easier on trees”
• Many NP-hard problems can be solved in polynomial time on trees (graphs with no
cycles)
Example: Maximum Weighted Independent Set: Complexity O(|V|)
(17,15)
2
(8,10)
(7,5)
1
(3,6)
3
3
2
(3,0) (2,0)
(4,1)
4
1
1
(1,0)
(1,0)
3
2
(3,0)
(2,0)
• We can generalize this dynamic programming approach to get polynomial
algorithms (in graph size) on graphs where tree-width is bounded.
29
Managed by UT-Battelle
for the U.S. Department of Energy
7
Heuristics for low-width decompositions
• In numerical linear algebra, one often wants to permute the rows of a matrix before
computing a factorization so that the resulting factors are as sparse as possible.
The objective is to minimize the number of “fill edges” added.
• For tree decompositions, we
instead need to minimize the
maximum clique size in the
resulting chordal graph.
• Numerous implementations of
common heuristics are
available, and we tested
several on a large set of
random graphs with a fixed
maximum width and varying
sizes.
Comparison of width and fill from 6 heuristics on
graphs known to have tw <= 30
30
Managed by UT-Battelle
for the U.S. Department of Energy
• Min-degree-based heuristics
are orders of magnitude faster
than min-fill, etc.
Empirical results: Synthetic
AMD Upper Bounds:
MCS Lower Bounds:
31
Managed by UT-Battelle
for the U.S. Department of Energy
More…
AMD Upper Bounds:
MCS Lower Bounds:
32
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical Results: Facebook
33
Managed by UT-Battelle
for the U.S. Department of Energy
Empirical Results: Autonomous Systems
34
• A larger AS graph had similar results: 600K nodes resulted in a 200K largest
connected component, and the upper bound was 5961, lower bound 32.
Managed by UT-Battelle
for the U.S. Department of Energy
Problems with Using Tree Decompositions
• Every bag in a tree decomposition is a vertex separator, so a low-width
decomposition means many small separators.
• Treewidth is O(n) w/ high probability for many random graphs (Gao 2009):
– Erdos-Renyi graphs G(n,m) when m/n > 1.073
– Random intersection graphs G(n,m,p) on universe {1,…m} with m=na, p at least
2/m and a > 0.
– Barabasi-Albert preferential attachment with at least 12 new edges for each
additional vertex.
• Current heuristics get lost in “local noise”
35
Managed by UT-Battelle
for the U.S. Department of Energy
Average k-cores on a tree decomposition
Temperature: 20
Power law exp: 2.1
Avg deg target: 5
36
Managed by UT-Battelle
for the U.S. Department of Energy
Average k-cores on a tree decomposition
37
Managed by UT-Battelle
for the U.S. Department of Energy
Temperature: 0.5
Power law exp: 2.1
Avg deg target: 20
Real Graphs
38
Managed by UT-Battelle
for the U.S. Department of Energy
What’s next?
• Clustering
• Diffusions
• Sparse Dimensionality Reduction
• Applications to Statistical Inference
39
Managed by UT-Battelle
for the U.S. Department of Energy
Acknowledgements
Primary support for this work through the ORNL Laboratory Directed Research &
Development SEED Program.
These slides would
not have been
possible without
many hours of
hard work by
Aaron Adcock.
40
Managed by UT-Battelle
for the U.S. Department of Energy
Backup Slides
41
Managed by UT-Battelle
for the U.S. Department of Energy
Motivation for some improvements to min-degree
Minimum Fill-In
Eliminate
2
Minimum Degree
Eliminate
42
Managed by UT-Battelle
for the U.S. Department of Energy
9
Tiebreaking with second neighbors
Joint work with Gloria D’Azevedo (ORHS student) and Chris Groer (ORNL).
• Gloria investigated various strategies for
breaking ties within min-degree and min-fill
algorithms
• Her hypothesis was that including information
about second-neighborhoods could improve
the quality of these heuristics
• Even with optimizations, the running time of
the improved algorithms was often
significantly slower than random tie-breaks
due to computation of additional information
(fill or second-neighborhood sizes)
43
Managed by UT-Battelle
for the U.S. Department of Energy
An example where second neighbors help
MIND
44
Managed by UT-Battelle
for the U.S. Department of Energy
MIND+(0.5)(SEC)
0.0:
0.5:
1.0:
1.5:
2.0:
2.5:
45
Managed by UT-Battelle
for the U.S. Department of Energy
45