CONFIDENCE INTERVALS FOR THE DEGREE DISTRIBUTION OF A GRAPH UNDER INDUCED SUBGRAPH SAMPLING By Robert Garrard ∗ School of Economics, University of Adelaide We study the problem of constructing confidence intervals for the degree distribution of a graph when degrees are sampled via induced subgraph sampling. This sampling method results in observations that are not independent, an ill-conditioned design matrix, and noise whose covariance matrix depends on nuisance parameters. We propose a Monte Carlo method akin to a parametric bootstrap whereby the degree distribution is estimated using a truncated singular value decomposition and several graphs respecting the estimated degree distribution are constructed randomly from which we may draw pseudosamples. For each graph the relevant quantiles of the estimator are determined and confidence intervals are constructed by taking the minimum and maximum values of the respective quantiles over all the graphs. ∗ E-mail: [email protected] Keywords and phrases: Induced subgraph, density estimation, inverse problem 1 2 PREFACE Thesis Title: Estimation of Distributional Properties of Social Networks Supervisors: Dr Firmin Doko Tchatoka, Dr Virginie Masson Traditional neoclassical economic theory tends to view economic behavior through the lens of centralized markets which are perfectly competitive and obtain a market clearing price and quantity in equilibrium. In order to study a richer set of phenomena, modern theory attempts to relax some of these simplifying assumptions. Perfect competition is frequently relaxed through the introduction of a continuum of monopolistically competitive intermediate goods firms or an increasing returns to scale production technology, and the existence of a market clearing price and quantity may be substituted for search and matching models (such as the Diamond-Mortensen-Pissaredes model of labor search). The study of social networks attempts to generalize economic exchange away from centralized markets to interactions between sparsely connected agents. Much theoretical headway has been made toward understanding how the structural features of social networks can affect the underlying behaviors, but empirical techniques for measuring these features are yet to catch up. My thesis studies procedures for estimating and conducting inference on a key feature of a social network: its distribution of connections. Chapter one is concerned with testing for stochastic dominance relationships between the degree distributions of two networks or the same network measured over time. Under mild assumptions on a social welfare function, efficiency of one network architecture over another translates into a stochastic dominance relationship of the degree distributions (similar to ranking income distributions). We propose a statistical test which corrects for sample correlation and prove the validity of a bootstrap for the test statistics. Chapter two, the paper presented here, studies how to estimate the degree distribution under a common sampling technique, induced subgraph sampling, and proposes a construction of confidence intervals based on Monte Carlo methods. Chapter three addresses goodness-of-fit testing for induced subgraph samples. The design matrix associated with the sampling method is ill-conditioned, so typical hypothesis tests have remarkably low power. We consider regularized regression methods for increasing the power of hypothesis tests while maintaining the advertized size. 3 1. Introduction. Graphs are an increasingly popular tool used to model complex relationships and interactions, such as the spread of contagion through a financial system (Nier et al. 2007, Gai and Kapadia 2010) , risk sharing (Bramoulle and Kranton 2007), trade in the absence of markets (Kranton and Minehart 2001), and even the interaction between proteins within a cell (Pellegrini et al. 2004).1 One of the many salient features of a graph is its degree distribution, which captures the pattern of direct connections between nodes. Albert et al. (2000) show that the degree distribution is strongly tied to the ability of a graph to withstand failures of some of its components. Doyle et al. (2005) coined the term “robust-yet-fragile” to refer to graphs that are robust to random failures but vulnerable to targeted attacks or failures of certain “key” components. This feature is common to many real-world graphs such as the webgraph of the internet or interbank lending networks in a financial system (Boss et al. 2004, Gai et al. 2011). Since graphs are often too large to observe in their entirety, inferences about their features must be made from sampled subgraphs. One such sampling method, induced subgraph sampling, involves taking a simple random sample of nodes and observing only the connections between those nodes sampled. This yields a sampled subgraph for which we may compute features of interest, such as measures of centrality, clustering, etc. However, this sampling design distorts the degree distribution by systematically omitting links from the sample. In this paper we study how to estimate the density of a graph’s degree distribution based on an induced subgraph sample and propose a Monte Carlo method for constructing a confidence region. Let G = (V, E) be a graph and let V 0 ⊆ V and E 0 ⊆ E be a subset of nodes and edges. G0 = (V 0 , E 0 ) is said to be an induced subgraph when any pair of nodes a, b ∈ V 0 are adjacent in G0 if and only if they are adjacent in G. An induced subgraph sample is formed by simple random sampling a subset of nodes and constructing the subgraph induced by those nodes. Since only connections to other nodes in the sample are measured, a node’s degree in the sampled subgraph will be biased downward from its degree in the population graph. Furthermore, despite simple random sampling of the nodes, the observed degrees in the induced subgraph do not constitute independent draws. This takes kernel density estimation of the degree distribution of G off the table. Not only would it be estimating a distorted version of the distribution, but the failure of independence would prevent cross-validation from selecting the optimal bandwidth. Suppose we draw an induced subgraph sample G0 from a graph G. Let N = |V | and n = |V 0 | be the number of nodes in the population and sample 1 The terms graph and network will be used interchangeably. 4 (a) True Distribution (b) Simple Estimator Fig 1: Estimate of degree distribution using simple inversion. Sample of size n = 2000 drawn from an Erdős-Rényi random graph on N = 10, 000 nodes with probability parameter such that N p = 7. respectively. Let β ∈ RN represent the degree distribution of G such that βi is the proportion of nodes in G with degree i = 0, . . . , N − 1, and let y ∈ Rn be defined similarly for G0 . Frank (1980) shows that (1) E[y] = Xβ where (2) N − 1 −1 N −1−j j Xij = n−1 n−1−i i The system in (1) is under-determined since N >> n. Supposing there is a known maximum degree in G such that only the first M elements of β are possibly non-zero, the system reduces to an inverse problem. Defining ỹ and X̃ to be the first M rows and columns of y and X respectively, a natural unbiased estimator of β is (3) β̂ = X̃ −1 ỹ However, as shown in Zhang et al. (2015), the matrix X̃ is ill-conditioned with rapidly decaying singular values. This results in an unstable inversion when y is measured with noise. Figure 1 illustrates the undesirable nature of this estimator. Zhang et al. (2015) derive conditions under which the distribution of noise from sampling error may be considered approximately normal and propose 5 an estimator which is the solution to the following quadratic programming problem. (4) minimize (ỹ − X̃β)0 C −1 (ỹ − X̃β) + λ||Dβ||22 subject to β≥0 β ||β||1 = 1 Where C is an estimator of the covariance matrix of the noise, D is a seconddifferencing operator to impose smoothness, and λ is chosen through Monte Carlo SURE. While this enforces desired behavior of the estimator by constraint, its properties become difficult to study due to the requirement that both the estimator and the tuning parameter be determined numerically. Furthermore, the impact of the `1 constraint is unclear. By the very nature of the problem, X̃ has eigenvalues quickly decaying to zero, violating many of the typical restricted eigenvalue-type conditions for estimation with an `1 penalty, such as in Greenshtein et al. (2004), Fuchs (2005), Donoho (2006), and Zhao and Yu (2006). The rest of the paper proceeds as follows. Section 2 discusses estimation of the degree distribution. Section 3 proposes a Monte Carlo method for constructing confidence intervals. Section 4 concludes with a discussion. 2. Estimating the Degree Distribution. Let G be a graph on N nodes with an unknown degree distribution. Let G0 be an induced subgraph formed by simple random sampling n nodes from G. Following Frank (1980) and Zhang et al. (2015) we will assume G has a known maximum degree M ≤ n. Let β denote the degree distribution of G, where βi is the proportion of nodes with degree i for i = 0, . . . , M . Similarly, let y denote the empirical degree distribution of the sampled subgraph, where yi is the proportion of nodes in G0 with degree i for i = 0, . . . , M . Let X be an (M + 1) × (M + 1) matrix with ij-entry as in equation (2), for i, j = 0, . . . , M . y is generated according to the linear model with fixed design (5) y = Xβ + ε Where ε is noise introduced by sampling error with mean 0 and covariance matrix Ω. Frank (1980) shows that Ω exhibits both heteroscedasticity and non-zero off-diagonal elements, each of which depend on higher order properties of the graph G. Thus, Ω is not in general consistently estimable due to its dependence on nuisance parameters, making GLS estimation of β infeasible. 6 (a) True Distribution (b) Truncated SVD Estimator Fig 2: Estimate of degree distribution in figure 1 using truncated SVD with sample size n = 2000. Estimation of β requires a solution to the above ill-posed inverse problem for which we will employ a truncated singular value decomposition (SVD). The design matrix may be decomposed into (6) X = U ΣV 0 Where U and V are (M + 1) × (M + 1) orthogonal matrices and Σ is a diagonal matrix whose entries σ0 ≥ σ1 ≥ · · · ≥ σM ≥ 0 are the singular values of X. The inverse of X may be written as (7) X −1 = V Σ−1 U 0 Since X is ill-conditioned, some of its singular values are very close to zero. This leads to extremely large elements in Σ−1 , inflating the effect of noise and creating an unstable inversion. An approximate inverse which is well conditioned may be found by applying the following truncation. ( 1 if σii ≥ t (8) X † = V Σ† U 0 Σ†ii = σii 0 otherwise Ideally, the choice of the bandwidth t would be data driven. However, the dependence within the sample removes the possibility to choose t through cross-validation or bootstrap methods and the properties of the noise violate the assumptions for SURE. Thus we choose t heuristically. Noting that we obtain a consistent estimator only if t → 0, we have found reasonable success choosing t = n1/2 for sampling rates less than 20% and otherwise choosing 7 t = n−1/k where k is the tens digit of the sampling rate (eg, k = 3 for 30%). We employ the following consistent estimator. (9) β̂ = X † y + Where (x)+ := max{x, 0} to impose non-negativity. Figure 2 illustrates the SVD estimator for a sample taken from an Erdős-Rényi random graph. 3. Monte Carlo Confidence Intervals. Suppose we wish to construct a 1−α simultaneous confidence region for β. Placing confidence intervals around our density estimate typically requires either knowledge of the sampling distribution of the estimator together with a consistent estimator of the standard error, or the use of the non-parametric bootstrap. Neither are possible here since the standard errors depend on nuisance parameters and the observations are not independent. We propose a Monte Carlo method for constructing confidence intervals akin to a parametric bootstrap. Let β̂ be a consistent estimator of β. Let G be the set of all graphs on ||β̂||1 nodes such that for each g ∈ G, g has degree distribution β̂. Algorithm 1 Confidence Bands 1: for i = 1 to b do 2: Sample a graph gi ∈ G uniformly at random 3: for j = 1 to c do 4: Sample an induced subgraph, hj , on n nodes from gi 5: Determine the empirical degree distribution, yj∗ , of hj 6: Let β̂j∗ = X † yj∗ + 7: end for 8: Let Li and Ui be the α2 and 1 − α2 quantiles of β̂ ∗ respectively 9: end for 10: Let L = min{Li } and U = max{Ui } Where the quantiles, minima, and maxima are understood componentwise. Here the reason for choosing a truncated SVD estimator becomes apparent. The pseudoinverse X ∗ may be precalculated such that estimation of β̂ ∗ reduces to matrix multiplication. A constrained least squares approach would require the solution of a quadratic programming problem at each iteration. This approach is similar to a parametric bootstrap because we attempt to mimic the sampling distribution of β̂ by first estimating the parameter and then drawing pseudosamples from a DGP indexed by that parameter. However, since the DGP is not uniquely pinned down by just the degree distribution, it also depends on higher order properties of the graph, we sample 8 uniformly over all possible graphs which have that degree distribution and in effect consider the worst case scenario. This should lead to confidence intervals which are asymptotically conservative. This construction requires the ability to sample uniformly over the set of graphs with a given degree distribution. This can be achieved quite easily with the so-called Configuration model (Bender and Canfield 1978, Bollobás 2001), which reduces the problem of randomly generating a graph with a prescribed degree distribution to that of creating a random matching. 3.1. Sampling Graphs with a Prescribed Degree Distribution. Let d1 , . . . , dN be the set of degrees of each node implied by the degree distribution.2 We wish to construct a graph with this degree sequence. Begin with a set of nodes, i = 1, . . . , N . Endow each node i, with a set of di stubs (or half-links) emanating from the node. Now construct a random matching on the set of stubs, and connect each pair of stubs that are matched to form a link. To randomly match the stubs, create a list of the node labels in which label i appears di times, then form a random permutation of the list. To construct the graph, start at the first entry in the permuted list and begin pairing off the stubs of nodes with adjacent labels. Example 1. 4, 2, 2, 1, 1. Suppose we wish to construct a graph with degree sequence Fig 3: Nodes with 4, 2, 2, 1, and 1 stubs respectively. Construct the list: 1111223345. Produce a random permutation: 5113212314. Connect adjacent nodes. 2 Which is unique up to a relabeling of the nodes. 9 (a) Connect the stubs. (b) Resulting Graph Given a degree sequence, d1 , . . . , dm , We randomly sample the adjacency matrix of a graph g ∈ G on m nodes as follows. Algorithm 2 Sample a Graph 1: 2: 3: 4: 5: 6: 7: Let S be a vector of stub labels G ← zeros(m, m) P ← randperm(S) for i = 1 to length(P ) − 2 do G(P (i), P (i + 1)) ← 1 G(P (i + 1), P (i)) ← 1 end for This yields an adjacency matrix G from which we may sample induced subgraphs. Under this sampling method it’s possible for the sampled graph to have self-loops and multiple edges between nodes, though the probability of this is declining with the number of nodes. It’s feasible to only accept samples which are simple graphs, but for the sake of computational speed we will not do so. Figure 5 illustrates confidence regions generated by this procedure for various sampling rates. Sampling rates for graphs may be as high as 50% (Jackson et al. 2012, Banerjee et al. 2012). Due to the way in which the confidence intervals are generated, they need not be centered around the density estimate. However, since the threshold parameter for the SVD estimator increases with the sampling rate, the confidence bounds automatically become smoother and more tightly concentrated around the estimate. 4. Discussion. In this paper we have proposed a method for constructing confidence intervals around a density estimate of a graph’s degree distribution based on Monte Carlo construction of graphs whose degree distributions agree with the estimate. For these confidence intervals to attain their advertized coverage in finite samples, it’s necessary for the density estimator 10 (a) Sampling Rate: 10% (b) Sampling Rate: 20% (c) Sampling Rate: 30% (d) Sampling Rate: 50% Fig 5: 95% Confidence bands with b = 100 and c = 100. M = 15. to have good finite sample estimation risk. Zhang et al. (2015) propose an estimator that attempts to minimize estimation risk both through constraint (enforcing non-negativity and a known `1 norm) and Monte Carlo SURE. However, the combination of estimation through quadratic programming, determining the tuning parameter through Monte Carlo SURE, and forming confidence intervals with Monte Carlo construction of graphs proves to be a computational burden. Consequently we rely on a simpler estimation method based on a truncated singular value decomposition where the bandwidth is chosen heuristically. A data driven choice for this bandwidth given a dependent sample is a possible avenue for further research. References. Albert, R., Jeong, H., and Barabási, A.-L. (2000). Error and attack tolerance of complex networks. nature, 406(6794):378–382. Banerjee, A., Chandrasekhar, A. G., Duflo, E., and Jackson, M. O. (2012). The diffusion 11 of microfinance. NBER Working Papers 17743, National Bureau of Economic Research, Inc. Bender, E. A. and Canfield, E. (1978). The asymptotic number of labeled graphs with given degree sequences. Journal of Combinatorial Theory, Series A, 24(3):296 – 307. Bollobás, B. (2001). Random Graphs:. Cambridge University Press, Cambridge, 2 edition. Boss, M., Elsinger, H., Summer, M., and 4, S. T. (2004). Network topology of the interbank market. Quantitative Finance, 4(6):677–684. Bramoulle, Y. and Kranton, R. (2007). Risk-sharing networks. Journal of Economic Behavior & Organization, 64(3-4):275–294. Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306. Doyle, J. C., Alderson, D. L., Li, L., Low, S., Roughan, M., Shalunov, S., Tanaka, R., and Willinger, W. (2005). The robust yet fragile nature of the internet. Proceedings of the National Academy of Sciences of the United States of America, 102(41):14497–14502. Frank, O. (1980). Estimation of the number of vertices of different degrees in a graph. Journal of Statistical Planning and Inference, 40(1):45 – 50. Fuchs, J.-J. (2005). Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions on Information Theory, 51(10):3601–3608. Gai, P., Haldane, A., and Kapadia, S. (2011). Complexity, concentration and contagion. Journal of Monetary Economics, 58(5):453 – 470. Carnegie-Rochester Conference on public policy: Normalizing Central Bank Practice in Light of the credit Turmoi, 1213 November 2010. Gai, P. and Kapadia, S. (2010). Contagion in financial networks. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences. Greenshtein, E., Ritov, Y., et al. (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6):971–988. Jackson, M. O., Rodriguez-Barraquer, T., and Tan, X. (2012). Social capital and social quilts: Network patterns of favor exchange. American Economic Review, 102(5):1857– 97. Kranton, R. E. and Minehart, D. F. (2001). A Theory of Buyer-Seller Networks. American Economic Review, 91(3):485–508. Nier, E., Yang, J., Yorulmazer, T., and Alentorn, A. (2007). Network models and financial stability. Journal of Economic Dynamics and Control, 31(6):2033 – 2060. Tenth Workshop on Economic Heterogeneous Interacting AgentsWEHIA 2005. Pellegrini, M., Haynor, D., and Johnson, J. M. (2004). Protein interaction networks. Expert Review of Proteomics, 1(2):239–249. Zhang, Y., Kolaczyk, E. D., and Spencer, B. D. (2015). Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. Ann. Appl. Stat., 9(1):166–199. Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. Journal of Machine Learning Research, 7(Nov):2541–2563.
© Copyright 2026 Paperzz