CONFIDENCE INTERVALS FOR THE DEGREE DISTRIBUTION OF

CONFIDENCE INTERVALS FOR THE DEGREE
DISTRIBUTION OF A GRAPH UNDER INDUCED
SUBGRAPH SAMPLING
By Robert Garrard ∗
School of Economics, University of Adelaide
We study the problem of constructing confidence intervals for
the degree distribution of a graph when degrees are sampled via induced subgraph sampling. This sampling method results in observations that are not independent, an ill-conditioned design matrix,
and noise whose covariance matrix depends on nuisance parameters.
We propose a Monte Carlo method akin to a parametric bootstrap
whereby the degree distribution is estimated using a truncated singular value decomposition and several graphs respecting the estimated
degree distribution are constructed randomly from which we may
draw pseudosamples. For each graph the relevant quantiles of the estimator are determined and confidence intervals are constructed by
taking the minimum and maximum values of the respective quantiles
over all the graphs.
∗
E-mail: [email protected]
Keywords and phrases: Induced subgraph, density estimation, inverse problem
1
2
PREFACE
Thesis Title: Estimation of Distributional Properties of Social Networks
Supervisors: Dr Firmin Doko Tchatoka, Dr Virginie Masson
Traditional neoclassical economic theory tends to view economic behavior
through the lens of centralized markets which are perfectly competitive and
obtain a market clearing price and quantity in equilibrium. In order to study
a richer set of phenomena, modern theory attempts to relax some of these
simplifying assumptions. Perfect competition is frequently relaxed through
the introduction of a continuum of monopolistically competitive intermediate goods firms or an increasing returns to scale production technology, and
the existence of a market clearing price and quantity may be substituted for
search and matching models (such as the Diamond-Mortensen-Pissaredes
model of labor search). The study of social networks attempts to generalize
economic exchange away from centralized markets to interactions between
sparsely connected agents. Much theoretical headway has been made toward
understanding how the structural features of social networks can affect the
underlying behaviors, but empirical techniques for measuring these features
are yet to catch up.
My thesis studies procedures for estimating and conducting inference on
a key feature of a social network: its distribution of connections.
Chapter one is concerned with testing for stochastic dominance relationships between the degree distributions of two networks or the same network
measured over time. Under mild assumptions on a social welfare function, efficiency of one network architecture over another translates into a stochastic
dominance relationship of the degree distributions (similar to ranking income distributions). We propose a statistical test which corrects for sample
correlation and prove the validity of a bootstrap for the test statistics.
Chapter two, the paper presented here, studies how to estimate the degree
distribution under a common sampling technique, induced subgraph sampling, and proposes a construction of confidence intervals based on Monte
Carlo methods.
Chapter three addresses goodness-of-fit testing for induced subgraph samples. The design matrix associated with the sampling method is ill-conditioned,
so typical hypothesis tests have remarkably low power. We consider regularized regression methods for increasing the power of hypothesis tests while
maintaining the advertized size.
3
1. Introduction. Graphs are an increasingly popular tool used to model
complex relationships and interactions, such as the spread of contagion
through a financial system (Nier et al. 2007, Gai and Kapadia 2010) , risk
sharing (Bramoulle and Kranton 2007), trade in the absence of markets
(Kranton and Minehart 2001), and even the interaction between proteins
within a cell (Pellegrini et al. 2004).1 One of the many salient features of a
graph is its degree distribution, which captures the pattern of direct connections between nodes. Albert et al. (2000) show that the degree distribution
is strongly tied to the ability of a graph to withstand failures of some of
its components. Doyle et al. (2005) coined the term “robust-yet-fragile” to
refer to graphs that are robust to random failures but vulnerable to targeted
attacks or failures of certain “key” components. This feature is common to
many real-world graphs such as the webgraph of the internet or interbank
lending networks in a financial system (Boss et al. 2004, Gai et al. 2011).
Since graphs are often too large to observe in their entirety, inferences
about their features must be made from sampled subgraphs. One such sampling method, induced subgraph sampling, involves taking a simple random
sample of nodes and observing only the connections between those nodes
sampled. This yields a sampled subgraph for which we may compute features of interest, such as measures of centrality, clustering, etc. However, this
sampling design distorts the degree distribution by systematically omitting
links from the sample. In this paper we study how to estimate the density
of a graph’s degree distribution based on an induced subgraph sample and
propose a Monte Carlo method for constructing a confidence region.
Let G = (V, E) be a graph and let V 0 ⊆ V and E 0 ⊆ E be a subset of
nodes and edges. G0 = (V 0 , E 0 ) is said to be an induced subgraph when any
pair of nodes a, b ∈ V 0 are adjacent in G0 if and only if they are adjacent
in G. An induced subgraph sample is formed by simple random sampling
a subset of nodes and constructing the subgraph induced by those nodes.
Since only connections to other nodes in the sample are measured, a node’s
degree in the sampled subgraph will be biased downward from its degree
in the population graph. Furthermore, despite simple random sampling of
the nodes, the observed degrees in the induced subgraph do not constitute
independent draws. This takes kernel density estimation of the degree distribution of G off the table. Not only would it be estimating a distorted
version of the distribution, but the failure of independence would prevent
cross-validation from selecting the optimal bandwidth.
Suppose we draw an induced subgraph sample G0 from a graph G. Let
N = |V | and n = |V 0 | be the number of nodes in the population and sample
1
The terms graph and network will be used interchangeably.
4
(a) True Distribution
(b) Simple Estimator
Fig 1: Estimate of degree distribution using simple inversion. Sample of size
n = 2000 drawn from an Erdős-Rényi random graph on N = 10, 000 nodes
with probability parameter such that N p = 7.
respectively. Let β ∈ RN represent the degree distribution of G such that βi
is the proportion of nodes in G with degree i = 0, . . . , N − 1, and let y ∈ Rn
be defined similarly for G0 . Frank (1980) shows that
(1)
E[y] = Xβ
where
(2)
N − 1 −1
N −1−j
j
Xij =
n−1
n−1−i
i
The system in (1) is under-determined since N >> n. Supposing there is a
known maximum degree in G such that only the first M elements of β are
possibly non-zero, the system reduces to an inverse problem. Defining ỹ and
X̃ to be the first M rows and columns of y and X respectively, a natural
unbiased estimator of β is
(3)
β̂ = X̃ −1 ỹ
However, as shown in Zhang et al. (2015), the matrix X̃ is ill-conditioned
with rapidly decaying singular values. This results in an unstable inversion
when y is measured with noise. Figure 1 illustrates the undesirable nature
of this estimator.
Zhang et al. (2015) derive conditions under which the distribution of noise
from sampling error may be considered approximately normal and propose
5
an estimator which is the solution to the following quadratic programming
problem.
(4)
minimize
(ỹ − X̃β)0 C −1 (ỹ − X̃β) + λ||Dβ||22
subject to
β≥0
β
||β||1 = 1
Where C is an estimator of the covariance matrix of the noise, D is a seconddifferencing operator to impose smoothness, and λ is chosen through Monte
Carlo SURE. While this enforces desired behavior of the estimator by constraint, its properties become difficult to study due to the requirement that
both the estimator and the tuning parameter be determined numerically.
Furthermore, the impact of the `1 constraint is unclear. By the very nature
of the problem, X̃ has eigenvalues quickly decaying to zero, violating many
of the typical restricted eigenvalue-type conditions for estimation with an `1
penalty, such as in Greenshtein et al. (2004), Fuchs (2005), Donoho (2006),
and Zhao and Yu (2006).
The rest of the paper proceeds as follows. Section 2 discusses estimation
of the degree distribution. Section 3 proposes a Monte Carlo method for
constructing confidence intervals. Section 4 concludes with a discussion.
2. Estimating the Degree Distribution. Let G be a graph on N
nodes with an unknown degree distribution. Let G0 be an induced subgraph
formed by simple random sampling n nodes from G. Following Frank (1980)
and Zhang et al. (2015) we will assume G has a known maximum degree
M ≤ n. Let β denote the degree distribution of G, where βi is the proportion
of nodes with degree i for i = 0, . . . , M . Similarly, let y denote the empirical
degree distribution of the sampled subgraph, where yi is the proportion of
nodes in G0 with degree i for i = 0, . . . , M . Let X be an (M + 1) × (M + 1)
matrix with ij-entry as in equation (2), for i, j = 0, . . . , M . y is generated
according to the linear model with fixed design
(5)
y = Xβ + ε
Where ε is noise introduced by sampling error with mean 0 and covariance
matrix Ω. Frank (1980) shows that Ω exhibits both heteroscedasticity and
non-zero off-diagonal elements, each of which depend on higher order properties of the graph G. Thus, Ω is not in general consistently estimable due
to its dependence on nuisance parameters, making GLS estimation of β
infeasible.
6
(a) True Distribution
(b) Truncated SVD Estimator
Fig 2: Estimate of degree distribution in figure 1 using truncated SVD with
sample size n = 2000.
Estimation of β requires a solution to the above ill-posed inverse problem
for which we will employ a truncated singular value decomposition (SVD).
The design matrix may be decomposed into
(6)
X = U ΣV 0
Where U and V are (M + 1) × (M + 1) orthogonal matrices and Σ is a
diagonal matrix whose entries σ0 ≥ σ1 ≥ · · · ≥ σM ≥ 0 are the singular
values of X. The inverse of X may be written as
(7)
X −1 = V Σ−1 U 0
Since X is ill-conditioned, some of its singular values are very close to zero.
This leads to extremely large elements in Σ−1 , inflating the effect of noise
and creating an unstable inversion. An approximate inverse which is well
conditioned may be found by applying the following truncation.
(
1
if σii ≥ t
(8)
X † = V Σ† U 0
Σ†ii = σii
0
otherwise
Ideally, the choice of the bandwidth t would be data driven. However, the
dependence within the sample removes the possibility to choose t through
cross-validation or bootstrap methods and the properties of the noise violate
the assumptions for SURE. Thus we choose t heuristically. Noting that we
obtain a consistent estimator only if t → 0, we have found reasonable success
choosing t = n1/2 for sampling rates less than 20% and otherwise choosing
7
t = n−1/k where k is the tens digit of the sampling rate (eg, k = 3 for 30%).
We employ the following consistent estimator.
(9)
β̂ = X † y
+
Where (x)+ := max{x, 0} to impose non-negativity. Figure 2 illustrates the
SVD estimator for a sample taken from an Erdős-Rényi random graph.
3. Monte Carlo Confidence Intervals. Suppose we wish to construct a 1−α simultaneous confidence region for β. Placing confidence intervals around our density estimate typically requires either knowledge of the
sampling distribution of the estimator together with a consistent estimator
of the standard error, or the use of the non-parametric bootstrap. Neither are
possible here since the standard errors depend on nuisance parameters and
the observations are not independent. We propose a Monte Carlo method
for constructing confidence intervals akin to a parametric bootstrap.
Let β̂ be a consistent estimator of β. Let G be the set of all graphs on
||β̂||1 nodes such that for each g ∈ G, g has degree distribution β̂.
Algorithm 1 Confidence Bands
1: for i = 1 to b do
2:
Sample a graph gi ∈ G uniformly at random
3:
for j = 1 to c do
4:
Sample an induced subgraph, hj , on n nodes from gi
5:
Determine the empirical degree distribution, yj∗ , of hj
6:
Let β̂j∗ = X † yj∗ +
7:
end for
8:
Let Li and Ui be the α2 and 1 − α2 quantiles of β̂ ∗ respectively
9: end for
10: Let L = min{Li } and U = max{Ui }
Where the quantiles, minima, and maxima are understood componentwise. Here the reason for choosing a truncated SVD estimator becomes
apparent. The pseudoinverse X ∗ may be precalculated such that estimation
of β̂ ∗ reduces to matrix multiplication. A constrained least squares approach
would require the solution of a quadratic programming problem at each iteration.
This approach is similar to a parametric bootstrap because we attempt
to mimic the sampling distribution of β̂ by first estimating the parameter
and then drawing pseudosamples from a DGP indexed by that parameter.
However, since the DGP is not uniquely pinned down by just the degree distribution, it also depends on higher order properties of the graph, we sample
8
uniformly over all possible graphs which have that degree distribution and in
effect consider the worst case scenario. This should lead to confidence intervals which are asymptotically conservative. This construction requires the
ability to sample uniformly over the set of graphs with a given degree distribution. This can be achieved quite easily with the so-called Configuration
model (Bender and Canfield 1978, Bollobás 2001), which reduces the problem of randomly generating a graph with a prescribed degree distribution
to that of creating a random matching.
3.1. Sampling Graphs with a Prescribed Degree Distribution. Let d1 , . . . , dN
be the set of degrees of each node implied by the degree distribution.2 We
wish to construct a graph with this degree sequence. Begin with a set of
nodes, i = 1, . . . , N . Endow each node i, with a set of di stubs (or half-links)
emanating from the node. Now construct a random matching on the set of
stubs, and connect each pair of stubs that are matched to form a link.
To randomly match the stubs, create a list of the node labels in which
label i appears di times, then form a random permutation of the list. To
construct the graph, start at the first entry in the permuted list and begin
pairing off the stubs of nodes with adjacent labels.
Example 1.
4, 2, 2, 1, 1.
Suppose we wish to construct a graph with degree sequence
Fig 3: Nodes with 4, 2, 2, 1, and 1 stubs respectively.
Construct the list: 1111223345. Produce a random permutation: 5113212314.
Connect adjacent nodes.
2
Which is unique up to a relabeling of the nodes.
9
(a) Connect the stubs.
(b) Resulting Graph
Given a degree sequence, d1 , . . . , dm , We randomly sample the adjacency
matrix of a graph g ∈ G on m nodes as follows.
Algorithm 2 Sample a Graph
1:
2:
3:
4:
5:
6:
7:
Let S be a vector of stub labels
G ← zeros(m, m)
P ← randperm(S)
for i = 1 to length(P ) − 2 do
G(P (i), P (i + 1)) ← 1
G(P (i + 1), P (i)) ← 1
end for
This yields an adjacency matrix G from which we may sample induced
subgraphs. Under this sampling method it’s possible for the sampled graph
to have self-loops and multiple edges between nodes, though the probability
of this is declining with the number of nodes. It’s feasible to only accept
samples which are simple graphs, but for the sake of computational speed
we will not do so.
Figure 5 illustrates confidence regions generated by this procedure for
various sampling rates. Sampling rates for graphs may be as high as 50%
(Jackson et al. 2012, Banerjee et al. 2012). Due to the way in which the
confidence intervals are generated, they need not be centered around the
density estimate. However, since the threshold parameter for the SVD estimator increases with the sampling rate, the confidence bounds automatically
become smoother and more tightly concentrated around the estimate.
4. Discussion. In this paper we have proposed a method for constructing confidence intervals around a density estimate of a graph’s degree distribution based on Monte Carlo construction of graphs whose degree distributions agree with the estimate. For these confidence intervals to attain their
advertized coverage in finite samples, it’s necessary for the density estimator
10
(a) Sampling Rate: 10%
(b) Sampling Rate: 20%
(c) Sampling Rate: 30%
(d) Sampling Rate: 50%
Fig 5: 95% Confidence bands with b = 100 and c = 100. M = 15.
to have good finite sample estimation risk. Zhang et al. (2015) propose an
estimator that attempts to minimize estimation risk both through constraint
(enforcing non-negativity and a known `1 norm) and Monte Carlo SURE.
However, the combination of estimation through quadratic programming,
determining the tuning parameter through Monte Carlo SURE, and forming confidence intervals with Monte Carlo construction of graphs proves to
be a computational burden. Consequently we rely on a simpler estimation
method based on a truncated singular value decomposition where the bandwidth is chosen heuristically. A data driven choice for this bandwidth given
a dependent sample is a possible avenue for further research.
References.
Albert, R., Jeong, H., and Barabási, A.-L. (2000). Error and attack tolerance of complex
networks. nature, 406(6794):378–382.
Banerjee, A., Chandrasekhar, A. G., Duflo, E., and Jackson, M. O. (2012). The diffusion
11
of microfinance. NBER Working Papers 17743, National Bureau of Economic Research,
Inc.
Bender, E. A. and Canfield, E. (1978). The asymptotic number of labeled graphs with
given degree sequences. Journal of Combinatorial Theory, Series A, 24(3):296 – 307.
Bollobás, B. (2001). Random Graphs:. Cambridge University Press, Cambridge, 2 edition.
Boss, M., Elsinger, H., Summer, M., and 4, S. T. (2004). Network topology of the interbank
market. Quantitative Finance, 4(6):677–684.
Bramoulle, Y. and Kranton, R. (2007). Risk-sharing networks. Journal of Economic
Behavior & Organization, 64(3-4):275–294.
Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on information theory,
52(4):1289–1306.
Doyle, J. C., Alderson, D. L., Li, L., Low, S., Roughan, M., Shalunov, S., Tanaka, R., and
Willinger, W. (2005). The robust yet fragile nature of the internet. Proceedings of the
National Academy of Sciences of the United States of America, 102(41):14497–14502.
Frank, O. (1980). Estimation of the number of vertices of different degrees in a graph.
Journal of Statistical Planning and Inference, 40(1):45 – 50.
Fuchs, J.-J. (2005). Recovery of exact sparse representations in the presence of bounded
noise. IEEE Transactions on Information Theory, 51(10):3601–3608.
Gai, P., Haldane, A., and Kapadia, S. (2011). Complexity, concentration and contagion.
Journal of Monetary Economics, 58(5):453 – 470. Carnegie-Rochester Conference on
public policy: Normalizing Central Bank Practice in Light of the credit Turmoi, 1213
November 2010.
Gai, P. and Kapadia, S. (2010). Contagion in financial networks. Proceedings of the Royal
Society of London A: Mathematical, Physical and Engineering Sciences.
Greenshtein, E., Ritov, Y., et al. (2004). Persistence in high-dimensional linear predictor
selection and the virtue of overparametrization. Bernoulli, 10(6):971–988.
Jackson, M. O., Rodriguez-Barraquer, T., and Tan, X. (2012). Social capital and social
quilts: Network patterns of favor exchange. American Economic Review, 102(5):1857–
97.
Kranton, R. E. and Minehart, D. F. (2001). A Theory of Buyer-Seller Networks. American
Economic Review, 91(3):485–508.
Nier, E., Yang, J., Yorulmazer, T., and Alentorn, A. (2007). Network models and financial stability. Journal of Economic Dynamics and Control, 31(6):2033 – 2060. Tenth
Workshop on Economic Heterogeneous Interacting AgentsWEHIA 2005.
Pellegrini, M., Haynor, D., and Johnson, J. M. (2004). Protein interaction networks.
Expert Review of Proteomics, 1(2):239–249.
Zhang, Y., Kolaczyk, E. D., and Spencer, B. D. (2015). Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social
media networks. Ann. Appl. Stat., 9(1):166–199.
Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine
Learning Research, 7(Nov):2541–2563.