Network Entropy

Network Entropy
Edwin Hancock
Department of Computer Science
University of York
Supported by a Royal Society Wolfson Research Merit Award
My History
• 1977-­85: High energy physics (CERN: hyperon resonances, QCD quark model, SLAC: charm particle lifetimes)
• 1985 -­ : Pattern recognition and computer vision.
• 1991 -­ : Algorithms for graph matching, clustering and embedding.
• 2010 -­: Complex nets.
• My tree: Cecil Powell (2), Ernest Rutherford (3), JJThompson (4), Darwin (7), Newton(16) and William Golding (1).
Complex networks meets machine learning.
Outline
• Graphs and their spectra
• Von Neumann entropy
• Entropy component analysis
• Graph kernels
• Network inference
• Spin statistics
• Thermodynamic depth
Causal networks
• Nodes: object, agents, financial entities, brain region.
• Edges: indicate causal relations or interactions between objects (nodes). Can be directed or undirected.
• Characterised using correlation measures, e.g. Granger causality or transfer entropy..
Network structure and function • What can we learn about the function performed by a network from its structure?
• How do changes in structure reflect changes in function?
• Can we find network characterisations
that allow us to analyse these relationships?
Network Complexity
• Randomness complexity: probabilistically compute degree of disorganisation or randomness of a network using Shannon entropy.
• Statistical complexity: characterise
combinatorial structure using statistical features such as node degree statistics or Laplacian
spectrum
Randomness complexity
• Korner’s graph entropy: minimises cross entropy between graph and vertex packing polytope. Relies on computing chromatic numbers.
• Compute degree inhomogeneity using Shannon-­entropy.
• Weakness: Does not model vertex correlations.
Statistical Complexity
• Measure regularity: beyond randomness.
• Logical depth: variant of Kolmogorov-­Chaitin
complexity. • Transformation cost: number of topological operations or matrix calculations need to transform network to a canonical form.
Thermodynamic Depth
• Based on causality of heat flow rather than computational steps. Protein-­Protein Interaction Networks
Structural Variations
New York Stock Exchange
• Closing prices of several hundred stock over 6000 day trading period. Represent network time series. • On each day, measure correlation of time series for each pair of stock over 30 day window.
• Create link if correlation exceeds threshold.
• Gives time series of graphs, each one representing state of trading on a particular day.
• Can we determine function of network and effects of world economy by analysing structure of network.
Financial Network: Correlation of closing price time series for pairs of stock.
Analyse modular structure
• Path-­length and cycle structure.
• Notions of community, communicability and centrality.
Big Data: NYSE
Entropy structure of time series • New York Stock Exchange over 6000 days for 320 companies trading over entire period.
• Plot change in entropy vs time
Details
Complexity level analysis • Detecting modular structure can prove computationally cumbersome (must detect structures such as hubs, communities etc.).
• Does entropy serve as a convenient and simple measure of how structure of a network change with time?
• Will it allow us to detect and understand events that cause sudden changes?
• What does this reveal about network function?
Entropy
Entropy
• Measure of unpredictability of information content.
• Shannon H(X)=E(I(X)) – expected value of information content of a random variable.
• Informativeness I(X)=ln [1/P(X)] and so H(X)=-­E(ln[P(X)].
• If X={𝑥",….., 𝑥#} is a set of random variables, then H(X)=-­∑#'(" 𝑝(𝑥') ln 𝑝(𝑥')
Entropy and Information
• I(p)>0, information is non-­negative;; I(1), sure events convey no information;; I(𝑝" ,
𝑝) )=I(𝑝" ) + I (𝑝) ), information is additive for independent events.
• I(p)=log(1/p)
Entropy and thermodynamics
• Statistical mechanics gives the Gibb’s entropy as
𝑆 = −𝑘2 3 𝑝' ln 𝑝' where 𝑘2 is the Boltzmann constant, and 𝑝'
is the microstate probability.
Entropy in Quantum Mechanics
• Von Neumann entropy S = −𝑘7 Tr[𝜌 ln 𝜌]
𝜌 is the density matrix of the system
Link between TD and IT
• Entropy of a macrostate is given by 𝑆 = −𝑘2 ln W
W is the number of microstates of the system.
Graph spectra Laplacian Matrix
• Weighted adjacency matrix
⎧w(u, v) (u, v) ∈ E
W (u, v) = ⎨
otherwise
⎩ 0
• Degree matrix
D(u, u ) = ∑W (u, v)
v∈V
• Laplacian matrix
L = D −W
Laplacian spectrum
• Spectral Decomposition of Laplacian
L = ΦΛΦ T = ∑ λkφkφkT
0 = λ1 ≤ λ2 ≤ .... ≤ λ|V |
k
Λ = diag (λ1 ,...., λ|V | )
Φ = (φ1 | ..... | φ|V | )
• Element-­wise
L(u, v) = ∑ λkφk (u )φk (v)
k
Properties of the Laplacian
• Eigenvalues are positive and smallest eigenvalue is zero
0 = λ1 < λ2 < ..... < λ|V |
• Multiplicity of zero eigenvalue is number connected components of graph.
• Zero eigenvalue is associated with all-­ones vector.
• Eigenvector associated with the second smallest eigenvector is Fiedler vector.
Von Neumann entropy and node degree
Von-­Neumann Entropy
• Passerini and Severini – normalised Laplacian is density matrix for graph Hamiltonian 𝜌 = GHK|J|
−1 / 2
−1 / 2
T
ˆ
ˆ
ˆ
ˆ
L = D ( D − A) D
= ΦΛΦ
• Von Neumann entropy is H=-­𝑇𝑟[𝜌 ln 𝜌 ]
|V |
HVN = −∑
i =1
λˆi
|V |
ln
λˆi
|V |
Approximation
• Quadratic entropy
|V |
HVN
λˆi ⎧
λˆi ⎫
1 |V | ˆ
1
=∑
λi − 2
⎨1 −
⎬ =
∑
| V | ⎭ | V | i =1
|V |
i =1 | V | ⎩
• In terms of matrix traces
1
1
2
ˆ
ˆ
H VN =
Tr[ L] − 2 Tr[ L ]
|V |
|V |
|V |
ˆ2
λ
∑ i
i =1
Computing Traces
• Normalised Laplacian
Tr[ Lˆ ] =| V |
• Normalised Laplacian squared
1
Tr[ L̂ ] =| V | + ∑
(u,v)∈E du dv
2
Simplified quadratic entropy
With quadratic approximation p ln p =-­p(1-­p) von Neumann entropy reduces to H VN
1
1
= 1−
− 2
| V | |V |
1
∑
( u ,v )∈E d u d v
(𝑑P )-­-­-­-­-­-­-­-­-­(𝑑Q )
Homogeneity index (Estrada)
Based on degree statistics
ρ (G ) =
−1/ 2
−1 / 2 2
(
d
−
d
)
∑ u
v
( u ,v )∈E
⎧⎪ 1 1
2 ⎫⎪
ρ (G ) =
⎨ + −
⎬
∑
| V | −2 | V | −1 (u ,v )∈E ⎪⎩ d u d v
d u d v ⎪⎭
1
Homogeneity meaning
Limit of large degree
ρ (G) ~
∑ {CT (u, v) − 2 A(u, v)}
( u ,v )∈E
Largest when commute time differs from 2 due to large number of alternative connecting paths.
Properties
Based on degree statistics
Extremal values for cycle and star-­graphs
Can be used to distinguish Erdos-­Renyi, small worlds, and scale free networks.
Extend to directed graphs
• Described in Cheng et al Phys. Rev E 2014.
• Commence from Chung’s spectral specificiation
of directed graph Laplacian.
• Repeat analysis steps to extend quadratic approximation of VN entropy to directed graphs.
• Find specific approximations for strongly and weakly directed graphs.
Directed Laplacian
Transition matrix left eigenvector components Noramised Laplacian
Trace calculations
Directed graphs
• Directed edge (𝑑P'# ,𝑑PRPS ) ⟶(𝑑Q'# ,𝑑QRPS )
• Partition edge set into undirectional edges (one way only) E1 and bidirectional edges (both ways) E2.
• Edge set is union E=E1U E2
Directed Graphs
Von Neumann entropy depndes on in-­degree and out-­degree of vertices connected by edges
duin
1
1 ⎧
1 ⎫
H = 1 − − 2 ⎨ ∑ in out 2 − ∑ out out ⎬
V 2V ⎩(u ,v )∈E d v (du ) (i , j )∈E2 du d v ⎭
Development comes from Laplacian of a directed graph (Chung).
⎛ duin ⎞ 1
1
1 ⎧
1 ⎫
H = 1 − − 2 ⎨ ∑ ⎜⎜ out ⎟⎟ out in − ∑ out out ⎬
V 2V ⎩(u ,v )∈E ⎝ du ⎠ du d v (i , j )∈E2 du d v ⎭
Strongly Directed Graphs
Most of edges are unidirectional, few bidirectional edges (|E1|>>|E2|)
1
1
H = 1− − 2
V 2V
1
∑
out in
d
( i , j )∈E u d v
Weakly Directed Graphs
Most of edges are bidirectional, few unidirectional edges (|E1|<<|E2|)
1
1
H = 1− − 2
V 2V
d uin / d uout + d vin / d vout
∑
out in
d
( u ,v )∈E
u dv
Development comes from Laplacian of a directed graph (Chung).
Links to assortivity
• Weakly directed: proportional to in/out degree ratio of nodes;; inversely proportional to product of out degree of start node and in degree of end node (degree flow). • Strongly directed: inversely proportional to product of out degree of start node and in degree of end node.
Financial Market Data
• Look at time series correlation for set of leading stocks.
• Create undirected or directed links on basis of time series correlation.
• Directed von Neumann entropy as stock market network evolves.
1997 A sian financial crisis
• Troughs represent financial crises, while the deepest one corresponds to Black Monday, 1987. Friday the 13th mini-­crash
Black Monday
Bankruptcy o f Lehman Brothers
• Directed von Neumann entropy change during Black Monday, 1987. • Entropy witnesses a sharp drop on Black Monday and recovers in a few trading days’ time.
Black Monday
What happened
• Entropy drops during crisis and remains lower than pre-­crisis level (via metastable intermediate state).
• Phase transition.
• Low degree connections replaced by high degree connections.
• Network pathlengths reduced
•
Histograms of added edges in degree space during financial crisis (Black Monday).
•
Most edges have higher probability to appear between two nodes with low degrees.
•
Histograms of removed edges in degree space during financial crisis (Black Monday).
•
Edges that connect two nodes with low degrees have higher probability to disappear.
•
Histograms of unchanged edges in degree space during financial crisis (Black Monday).
•
Edges that connect two nodes with higher degrees are more likely to remain unchanged.
…but
• Different crises have a different structure.
• Entropy (plus additional thermodynamic concepts such as temperature and volume) allow these to be analysed and classified.
• Taxonomy of phase transitions.
Simple thermodynamic model
• Average energy is related to number of edges: U=Tr
[D1/2LD-­1/2].
• Energy change dU=TdS-­PdV
• Constant volume (fixed number of nodes)
ΔS
du Δ v + d v Δ u + Δ u Δ v
1/T =
= ∑
ΔU (u,v)∈E | ΔE | du (du + Δ u )(dv + Δ v )
• Measures node degree change correlations.
Entropy Component Analysis
Multivariate histograms of entropy with degree
Idea
• Undirected graph – index entropy of edge (u,v) by Cartesian pair (𝑑P ,𝑑Q ).
• Strongly directed graph – index entropy of directed edge (u,v) by triple (𝑑P'#, 𝑑PRPS , 𝑑Q'#).
• Use index histogram of edge entropy.
• Vectorise histograms.
• Perform PCA over a sample of vectors for different graphs.
Entropy Increments When cardinality of bidirectional edge set is very small, i.e., the graph is strongly directed (SD), the entropy formula can be simplified a step further
Normalized local entropic measure for each undirectional edge in the graph
For bidirectional edges , we add an additional contribution to the above measure
Feature Vector from Entropy Distribution
Graph characterization: based on the statistical information conveyed by edge entropy distribution
Representation: 4D histogram over the in and out-­‐‑degrees of the two vertices connected by an edge.
Potential problem -­‐‑ bin-­‐‑contents can become sparse in a high dimensional histogram. Compute cumulative distribution function (CDF) over predefined quantiles.
Let be the in-­‐‑degree probability distribution of the graph, the corresponding CDF is .
Quantising the multivariate entropy distribution
The m-­‐‑quantiles of the in-­‐‑degree distribution are
Assign each vertex degree quantile labels ranging from 1 to m, allowing us to construct a 4D histogram whose size in each dimension is fixed to m.
Storing information: m x m x m x m array (histogram) M elements: represent the histogram bin-­‐‑contents,
indices: represent the degree quantile labels of the vertices.
Elementwise accumulation: formally given as Bidirectional Edges Bidirectional edges: additionally accumulate Feature Vectors Feature vector: concatenate the elements of M to give long-­‐‑vector
of length m^4.
Strongly directed graphs: entropy formula does not depend on -­‐‑
dimensionality of matrix M is reduced to 3.
PCA: perform PCA on feature vectors (entropy component analysis)
Ø Experiments and Discussion
Three classes of random graphs well separated, but
-­‐‑ ``small-­‐‑world"ʺ graphs and ``scale-­‐‑free"ʺ graphs show some overlap. -­‐‑ Suggests the full feature vectors are efficient in distinguishing any normal directed graphs
-­‐‑ Reduced vectors effective only for strongly directed graphs.
Observtions:
-­‐‑ classification performance is particularly good on 4-­‐‑object data
-­‐‑ on 8-­‐‑object data and 6-­‐‑class protein database, the accuracy is still acceptable. -­‐‑ all vertices have the same out-­‐‑degree 3, classification rates peak when m=3 since feature vectors preserve in and out-­‐‑degree statistics.
It is clear that our directed graph characterization is computationally tractable as the runtime does not increase rapidly even when the size of the feature vector becomes particularly large.
PCA applied to entropy feature vectors: distinct epochs of market evolution occupy different regions of the subspace and can be separated. Black Monday is a clear outlier. There appears to be some underlying manifold structure.
Fig. 4. PCA plot for directed graph embedding Graph Kernels
Kernel Functions
•
•
Graphs are structures and do not reside in a Euclidean or even a metric space.
– Can we embed them in such spaces?
– Based on their similarities or features characterising their internal structure.
What about this data:
If we do so, can we be sure the
embedding is useful and will
allow us to easily separate
different classes of graph.
Here data is not separable with a
linear boundary.
•
Graph Kernels
Random Walk Kernel (Gartner et al 2003)
– Count the number of matching walks between two graphs
K (G1 , G2 ) =
∞
∑
k
[
ε
A
∑ k × ]ij
( i , j )∈V× k = 0
–
–
–
–
A× is the product graph of G1 and G2
k is the walk length
The number of walks becomes very large
The random walk graph kernel suffers from the problem of tottering
– Reduces expressive power and masks structural differences
Jensen-­Shannon Kernel
• Defined in terms of J-­S divergence
K JS (Gi , G j ) = ln 2 − JS (Gi , G j )
JS (Gi , G j ) = H (Gi ⊕ G j ) − {H (Gi ) + H (G j )}
• Properties: extensive, positive.
Computation
• Construct direct product graph for each graph pair.
• Compute von-­Neumann entropy difference between product graph and two graphs individually.
• Construct kernel matrix over all pairs.
Financial Data
Financial Data
Financial Data
Financial Data
Financial Data
Network Inference
Generative Models • Structural domain: define probability distribution over prototype structure. Prototype together with parameters of distribution minimise description length (Torsello and Hancock, PAMI 2007) .
• Spectral domain: embed nodes of graphs into vector-­
space using spectral decomposition. Construct point distribution model over embedded positions of nodes (Bai, Wilson and Hancock, CVIU 2009).
Deep learning
• Deep belief networks (Hinton 2006, Bengio 2007).
• Compositional networks (Amit+Geman 1999, Fergus 2010).
• Markov models (Leonardis 200
• Stochastic image grammars (Zhu, Mumford, Yuille)
• Taxonomy/category learning (Todorovic+Ahuja, 2006-­
2008).
Aim
• Combine spectral and structural methods.
• Use description length criterion.
• Apply to graphs rather than trees.
Prior work
• IJCV 2007 (Torsello, Robles-­Kelly, Hancock) –shape classes from edit distance using pairwise clustering.
• PAMI 06 and Pattern Recognition 05 (Wilson, Luo and Hancock) – graph clustering using spectral features and polynomials.
• PAMI 07 (Torsello and Hancock) – generative model for variations in tree structure using description length.
• CVIU09 (Xiao, Wilson and Hancock) – generative model from heat-­kernel embedding of graphs.
Structural learning Using description length
Description length
• Wallace+Freeman: minimum message length.
• Rissanen: minimum description length. Use log-­posterior probability to locate model that is optimal with respect to code-­length.
Similarities/differences
• MDL: selection of model is aim;; model parameters are simply a means to this end. Parameters usually maximum likelihood. Prior on parameters is flat.
• MML: Recovery of model parameters is central. Parameter prior may be more complex.
Coding scheme
• Usually assumed to follow an exponential distribution.
• Alternatives are universal codes and predictive codes.
• MML has two part codes (model+parameters). In MDL the codes may be one or two-­part.
Method
• Model is supergraph (i.e. Graph prototypes) formed by graph union.
• Sample data observation model: Bernoulli distribution over nodes and edges.
• Mode: complexity: Von-­Neumann entropy of supergraphs.
• Fitting criterion:
MDL-­like-­make ML estimates of the Bernoulli parameters
MML-­like: two-­part code for data-­model fit + supergraph
complexity.
Model overview
• Description length criterion
L(G, Γ) = LL(G | Γ) + H ( Γ)
code-­length=negative + model code-­length
log-­likelihood (entropy)
Data-­set: set of graphs G
Model: prototype graph+correspondences with it
Updates by expectation maximisation: Model graph adjacency matrix (M-­step) + correspondence indicators (E-­step).
When nodes are labelled
Sample graph with adjacency matrix D and inferred graph with adjacency matrix M. Link probability p. Correspondence indicators (nodes labels) are known (usually case in complex networks)
|J| |J|
𝐿(𝑀|𝐷, 𝑃) = Z Z 𝑝 [\] ^\] (1 − 𝑝)"a[\] ^\]
_(" 7("
Estimator "
|J|
|J|
p=|J|b ∑'(" ∑'(" 𝐷_7 𝑀_7
Settings
• Learning generative model: Sample of graphs, unknown node labels (correspondences) estimate correspondences and mean connection strength for each edge in a model graph.
• Network inference: Sample of graphs, known node labels;; estimate connection probability and connections. Experiments
Delaunay graphs from images of different objects.
COIL dataset Toys dataset
Experiments-­-­-­validation
n
n
COIL dataset: model complexity increase, graph data log-likelihood
increase, overall code length decrease during iterations.
Toys dataset: model complexity decrease, graph data log-likelihood
increase, overall code length decrease during iterations.
Experiments-­-­-­classification task
We compare the performance of our learned supergraph on classification task with two alternative constructions , the median graph and the supergraph learned without using MDL. The table below shows the average classification rates from 10-­fold cross validation, which are followed by their standard errors. Experiments-­-­-­graph embedding
Pairwise graph distance based on the Jensen-­Shannon divergence and the von Neumann entropy of graphs
Experiments-­-­-­graph embedding
Edit distance JSD distance
Generative model
• Train on graphs with set of predetermined characteristics.
• Sample using Monte-­Carlo.
• Reproduces characteristics of training set, e.g. Spectral gap, node degree distribution, etc.
Erdos Renyi
Barabasi Albert (scale free)
Dealunay Graphs
Experiments-­-­-­generate new samples Thermodynmic Depth
Idea
• Use polytopal decomposition of heat-­
kernel to compute entropy.
• Find depth entropy flow with time is maximum.
• Use thermodynamic depth of network as characterisation.
Heat Kernels
• Solution of heat equation and measures information flow across edges of graph with time:
∂ht
L = D − W = ΦΛΦ T
= − Lht
∂t
• Solution found by exponentiating Laplacian eigensystem
ht = ∑ exp[−λk t ]φkφkT = Φ exp[−Λt ]Φ T
k
Heat kernel and random walk
• State vector of continuous time random walk satisfies the differential equation
∂pt
= − Lpt
∂t
• Solution
pt = exp[− Lt ] p0 = ht p0
Example.
Graph shows spanning tree of heat-­kernel. Here weights of graph are elements of heat kernel. As t increases, then spanning tree evolves from a tree rooted near centre of graph to a string (with ligatures).
Low t behaviour dominated by Laplacian, high t behaviour dominated by Fiedler-­vector.
Polytopal Decompositon
of Heat Kernel
Decompose heat kernel into weighted sum of permutation matrices P
𝐾d 𝐺 = ∑f 𝑝f 𝑃f
Permutohedra or Birkoff polytopes (convex hull of permutation matrices).
Polytopal complexity
a ∑q pq rs pq 𝐶d(G)=
rs #
Polytopal Complexity
Phase Change
★
Temporal complexity trace: captures the interaction between
heat diffusion and entropy!
Escolano, F., Hancock, E.R., Lozano, M..: Polytopal Graph Complexity, Matrix Permanent and E mbedding, S SPR 2008
Polytopal Complexity
Phase Change
★
Temporal complexity trace: a phase-transition point appears!
Escolano, F., Hancock, E.R., Lozano, M.: Polytopal Graph Complexity, Matrix Permanent and E mbedding, S SPR 2008
Polytopal Complexity
Phase Change
★
Initial experiments: analysis of several PPIs for embedding purposes
Escolano, F., Hancock, E.R., Lozano, M.: Polytopal Graph Complexity, Matrix Permanent and E mbedding, S SPR 2008
PPI analysis
Exploiting TD for evolutionary analysis
★
Histidine Kinase: is a key protein involved in
signal transduction accross membrane.
It is key for bacterial survival.
★
Hypothesis: More evolved bacteria tend to
refine and make more complex such function.
★
Method: Analyze 217 PPIs of HKs of several
bacteria phyla and seek for a correlation
between TD complexity and phylogeny.
Histidine Kinase
Aquifex aeolicus
(Aquifex) TD=57.8
Staphilococcus aureus
(Gram positive)
TD=85.6
Anabena Variabilis
(Cyanobacteria) TD=4638
Acidovorax Avenae
(Proteobacteria)
TD=58.3
Acidobacterium sp. Ellin 345
(Acidobacterie) TD=618.14
Its spritz time!